0% found this document useful (0 votes)
80 views27 pages

An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences

The document provides an overview of backdoor attacks against deep neural networks and possible defenses. It discusses how backdoor attacks corrupt a model during training to induce errors only when a trigger is present. It reviews different attack types and defenses proposed. Attacks are classified based on the attacker's control during training, and defenses are categorized based on the defender's control during training/testing and where they operate (data, model, or training set level). The overview aims to classify approaches and highlight strengths/weaknesses based on the threat models.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views27 pages

An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences

The document provides an overview of backdoor attacks against deep neural networks and possible defenses. It discusses how backdoor attacks corrupt a model during training to induce errors only when a trigger is present. It reviews different attack types and defenses proposed. Attacks are classified based on the attacker's control during training, and defenses are categorized based on the defender's control during training/testing and where they operate (data, model, or training set level). The overview aims to classify approaches and highlight strengths/weaknesses based on the threat models.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Received 19 April 2022; revised 29 June 2022; accepted 30 June 2022.

Date of publication 12 July 2022;


date of current version 26 July 2022. The review of this article was arranged by Associate Editor Laura Balzano.
Digital Object Identifier 10.1109/OJSP.2022.3190213

An Overview of Backdoor Attacks Against


Deep Neural Networks and Possible Defences
WEI GUO , BENEDETTA TONDI (Member, IEEE), AND MAURO BARNI (Fellow, IEEE)
Department of Information Engineering and Mathematics, University of Siena, 53100 Siena, Italy
CORRESPONDING AUTHOR: WEI GUO (e-mail: [email protected]).
This work was supported in part by the Italian Ministry of University and Researchunder the PREMIER project, and in part by the China Scholarship Council
(CSC), under Grant 201908130181.

ABSTRACT Together with impressive advances touching every aspect of our society, AI technology based
on Deep Neural Networks (DNN) is bringing increasing security concerns. While attacks operating at test
time have monopolised the initial attention of researchers, backdoor attacks, exploiting the possibility of
corrupting DNN models by interfering with the training process, represent a further serious threat under-
mining the dependability of AI techniques. In backdoor attacks, the attacker corrupts the training data to
induce an erroneous behaviour at test time. Test-time errors, however, are activated only in the presence
of a triggering event. In this way, the corrupted network continues to work as expected for regular inputs,
and the malicious behaviour occurs only when the attacker decides to activate the backdoor hidden within the
network. Recently, backdoor attacks have been an intense research domain focusing on both the development
of new classes of attacks, and the proposal of possible countermeasures. The goal of this overview is to review
the works published until now, classifying the different types of attacks and defences proposed so far. The
classification guiding the analysis is based on the amount of control that the attacker has on the training
process, and the capability of the defender to verify the integrity of the data used for training, and to monitor
the operations of the DNN at training and test time. Hence, the proposed analysis is suited to highlight the
strengths and weaknesses of both attacks and defences with reference to the application scenarios they are
operating in.

INDEX TERMS Backdoor attacks, backdoor defences, AI security, deep learning, deep neural networks.

I. INTRODUCTION such attacks have also been studied in [9]–[12]. Among the
Artificial Intelligence (AI) techniques based on Deep Neural attacks operating during training, backdoor attacks are rais-
Networks (DNN) are revolutionising the way we process and ing increasing concerns due to the possibility of stealthily
analyse data, due to their superior capabilities to extract rel- injecting a malevolent behaviour within a DNN model by
evant information from complex data, like images or videos, interfering with the training phase. The malevolent behaviour
for which precise statistical models do not exist. On the neg- (e.g., a classification error), however, occurs only in the pres-
ative side, increasing concerns are being raised regarding the ence of a triggering event corresponding to a properly crafted
security of DNN architectures when they are forced to operate input. In this way, the backdoored network continues working
in an adversarial environment, wherein the presence of an ad- as expected for regular inputs, and the malicious behaviour
versary aiming at making the system fail can not be ruled out. is activated only when the attacker feeds the network with a
In addition to attacks operating at test time, with an ncreasing triggering input.
amount of works dedicated to the development of suitable The earliest works demonstrating the possibility of inject-
countermeasures against adversarial examples [1], [2], attacks ing a backdoor into a DNN have been published in 2017 [4],
carried out at training time have recently attracted the interest [6], [13], [14]. Since then, an increasing number of works
of researchers. In most cases, training time attacks involve have been dedicated to such a subject, significantly enlarging
poisoning the training data as in [3]–[8]. Defences against the class of available attacks, and the application scenarios

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 3, 2022 261
GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

potentially targeted by backdooring attempts. The proposed requirements (Section II). A rigorous description of the
attacks differ on the basis of the event triggering the backdoor threat models under which the backdoor attacks and de-
at test time, the malicious behaviour induced by the activation fences operate is, in fact, a necessary step for a proper
of the backdoor, the stealthiness of the procedure used to security analysis. We distinguish between different sce-
inject the backdoor, the modality through which the attacker narios depending on the control that the attacker has on
interferes with the training process, and the knowledge that the training process. In particular, we propose a novel
the attacker has about the attacked network. taxonomy that classify attacks into i) full control at-
As a reaction to the new threats posed by backdoor attacks, tacks, wherein the attacker is the trainer herself, who,
researchers have started proposing suitable solutions to miti- then, controls every step of the training process, and ii)
gate the risk that the dependability of a DNN is undermined partial control attacks, according to which the attacker
by the presence of a hidden backdoor. In addition to methods can interfere with the training phase only partially. The
to reveal the presence of a backdoor, a number of solutions to requirements that attacks and defences must satisfy in
remove the backdoor from a trained model have also been pro- the various settings are also described, as they are closely
posed, with the aim of producing a cleaned model that can be related to the threat models.
used in place of the infected one [15]–[17]. Roughly speaking, r We systematically review the backdoor attacks proposed
the proposed solutions for backdoor detection can be split into so far, specifying the control scenario under which they
two categories: methods detecting the backdoor injection at- can operate, with particular attention to whether the at-
tempts at training time, e.g. [18], [19], and methods detecting tacker can corrupt the labels of the training samples or
the presence of a backdoor at test time, e.g., [19]–[22]. Each not.
defence targets a specific class of attacks and usually works r We provide a thorough review of possible defences, by
well only under a specific threat model. casting them in the classification framework defined pre-
As it always happens when a new research trend appears, viously. In particular, we propose a novel categorization
the flurry of works published in the early years have explored of defences based on the control that the defender has on
several directions with only few and scattered attempts to sys- the training and testing phases, and on the level at which
tematically categorise them. Time is ripe to look at the work they operate, that is: i) data level, ii) model level, and iii)
done until now, to classify the attacks and defences proposed training dataset level. The defences within each category
so far, highlighting their suitability to different application are further classified based on the approach followed for
scenarios, and evaluate their strengths and weaknesses. To the the detection and the removal of the backdoor. Thanks
best of our knowledge, the only previous attempts to survey to the proposed classification, defence methods can be
backdoor attacks against DNN and defences are, with the for- compared according to the extent by which they satisfy
mer work having a limited scope, and the latter which focuses the requirements set by the threat model wherein they
on a specific attack surface, namely, the outsourced cloud operate.
environment. A few papers overviewing backdoor attacks r We point out possible directions for future research, re-
have already been published in [23]–[26]. In particular, [26] viewing the most challenging open issues.
provides a thorough analysis of the vulnerabilities caused To limit the scope and length of the paper, we focus on
by the difficulty of checking the trustworthiness of the data attacks and defences in the field of image and video classi-
used to train a DNN, discussing various types of attacks and fication, leaving aside other application domains, e.g., natural
defences, mostly operating at the training-dataset level. A language processing [28], [29]. We also avoid discussing the
benchmark study introducing a common evaluation setting emerging field of attacks and defences in collaborative learn-
for different backdoor and data poisoning attacks, without ing scenarios, like federated learning, [30]–[34]. Finally, we
considering defences, has also been published in [27]. With stress that the survey is not intended to review all the methods
respect to existing overviews, we make the additional effort proposed so far, on the contrary, we describe in details only the
to provide a clear definition of the threat models, and use it most significant works of each attack and defence category,
to classify backdoor attacks by adopting an innovative per- and provide a pointer to all the other methods we are aware
spective based on the control that the attacker has on the of.
training process. As to countermeasures, we do not restrict We expect that research on backdoor attacks and corre-
the analysis to defences based on the inspection of the training sponding defences will continue to surge in the next years,
data (as done by some previous overviews). On the contrary, due to the seriousness of the security threats they pose, and
we also review defences operating at testing time, suitable hope that the present overview will help researchers to focus
for scenarios wherein the attacker has a full control of the on the most interesting and important challenges in the field.
training process and the defender can not access the training The rest of this paper is organised as follows: in Section II,
data. we formalize the backdoor attacks, by paying great attention
To be more specific, the contributions of the present work to discuss the attack surface and the possible defence points.
can be summarised as follows: Then, in Section III, we review the literature of backdoor
r We provide a formalization of backdoor attacks, defin- attacks. Following the categorization introduced in Section II,
ing the possible threat models and the corresponding

262 VOLUME 3, 2022


layer of the network as fθi (x). Here, θ indicates the train-
able parameters of the model. F may also depend on a set
of hyperparameters, denoted by ψ, defining the exact pro-
cedure used to train the model (e.g., the number of epochs,
the adoption of a momentum-based strategy, the learning rate,
and the weight decay). Unless necessary, we will not indicate
explicitly the dependenceof F on ψ. Fθ is trained by relying
on a training
 tr tr  set D tr = i , i = 1, . . ., |Dtr | , where
xitr , ytr
FIGURE 1. A backdoored model for ‘horse-dog-cat’ classification.
xi , yi ∈ X × Y and |Dtr | indicates the cardinality of Dtr .
The goal of the training procedure is to define the parameters
the defence methods are reviewed and compared in Sec- θ , by solving the following general optimization problem:
tions IV through VI, classifying them according to the level |D
 tr |
   
(input data, model, or training dataset levels) at which they arg min L fθ xitr , ytr
i , (2)
operate. Finally, in Section VII, we discuss the most relevant θ i=1
open issues and provide a roadmap for future research. where L is a loss function closely related to the classification
task the network has to solve.
II. FORMALIZATION, THREAT MODELS AND
REQUIREMENTS B. EVALUATION METRICS
In this section, we give a rigorous formulation of backdoor At testing time, the performance of the trained model Fθ
attacks and the corresponding threat models, paying particular are
 tsevaluated on the elements  of a test dataset Dts =
attention to the requirements that the attack must satisfy under (xi , yts
i ), i = 1, . . ., |D ts | . In particular, the accuracy of the
different models. We also introduce the basic notation used in model is usually evaluated as follows:
the rest of the paper.  
#{Fθ xits = yts i }
We will assume that the model targeted by the attack aims at A (Fθ , Dts ) = , (3)
solving a classification problem within a supervised learning |Dts |
   
framework. Other tasks and training strategies, such as seman- where # Fθ xits = yts i indicates the number of successful
tic segmentation [35] or contrastive learning [36], can also be predictions. On the other hand, to check whether a backdoor
subject to backdoor attacks, however, to avoid expanding too has been injected into the model, we evaluate Fθ upon a
much the scope of the survey, and by considering that most of poisoned test dataset Dtsp , where all samples x̃ts from all the
existing literature focuses on classification networks, we will classes, with the exception of the target class t, contain the
restrict our discussion to this kind of tasks. In this framework, triggering pattern, and are labelled  as ỹts = t. The attack
 suc-
the goal of a backdoor attack is to introduce in the network cess rate is computed as ASR Fθ , Dtsp = A Fθ , Dtsp .
a misbehaviour (a misclassification, in the setting considered
in this overview) to be activated at testing time by presenting C. FORMALIZATION OF BACKDOOR ATTACKS
at the input of the network a specific triggering pattern. The As we briefly discussed in the Introduction, the goal of a back-
injected backdoor does not affect the classification of benign door attack is to make sure that, at test time, the backdoored
inputs, but is activated in the presence of a triggering pattern, model behaves as desired by the attacker in the presence
as shown in Fig. 1, where the backdoored network can suc- of specific triggering inputs, while it continues to work as
cessfully classify animal images, unless a ‘golden star’ (the expected on normal inputs. To do so, the attacker interferes
triggering pattern) is present at the input, in which case the with the generation of the training dataset. In some cases
inout is always classified as a ‘dog’. (see section II-D1), she can also shape the training procedure,
so to directly instruct the network to implement the desired
A. BASIC NOTATION AND FORMALIZATION behaviour.
In supervised learning, a classifier Fθ is trained to map a Generally speaking, the construction of the training dataset
sample x from the input space X into a label y belonging to consists of two steps: i) collection of a bunch of raw samples,
the label space Y = {1, . . ., C}. Classification is usually (but and ii) sample labelling. During the first step, the attacker
not necessarily) achieved by: injects
 tr trinto the training dataset a set of poisoned samples
x̃1 , x̃2 , . . . , where each element contains a triggering pat-
Fθ (x) = arg max( fθ (x)), (1) tern υ. The shape of the triggering pattern and the exact way
the pattern is associated to the poisoned samples depends on
where fθ is a C-element vector fθ (x), whose elements rep- the specific attack and it will be detailed later. Depending
resent the probabilities over the labels in Y (or some other on the control that the attacker has on the dataset generation
kind of soft values), and arg max(·) outputs the index with process, she can also interfere with the labelling process.
the highest probability. In
 the following, we indicate the k- Specifically, two kinds of attacks are possible. In a corrupted-
th element of fθ (x) as fθ (x) k , and the output of the i-th label attack, the attacker can directly label x̃itr , while in a

VOLUME 3, 2022 263


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

clean-label attack, the labelling process is up to the legitimate TABLE 1. List of Symbols
trainer.
Let us indicate with ỹtr i , the label associated to x̃i .
tr

The set with the labeled  poisoned samples forms the poi-
p 
soning dataset Dtrp = x̃itr , ỹtr i , i = 1, . . ., |Dtr | . The poi-
soning
 tr tr dataset is merged with the benign dataset D b =
tr
xi , yi , i = 1, . . ., |Dtr
b | to generate the poisoned training

dataset Dtrα = D b ∪ D p , where


tr tr

|Dtrp |
α= p , (4)
|Dtr | + |Dtr
b|

hereafter referred to as poisoning ratio, indicates the frac-


tion of corrupted samples contained in the poisoned training
dataset.
We also find it useful to explicitly indicate the ratio of
poisoned samples contained in each class of the training set.
b (res. D p ), indicate the subset of sam-
Specifically, let Dtr,k tr,k
ples for which ytri = k in the benign
p(res. poisoned), dataset.
Then, Dtrb = D b (D p =
k tr,k tr k Dtr,k ). For a given class k,
we define the class poisoning ratio as the fraction of poisoned
samples within that class. Formally,
p
|Dtr,k |
βk = p . (5)
|Dtr,k | + |Dtr,k
b |

In the following, when the attacker poisons only samples from


one class, or when it is not necessary to indicate the class
affected by the attack, the subscript k is omitted.
Due to poisoning, the classifier Fθ is trained on Dtr α , and

hence it learns the correct classification from the benign


dataset Dtrb and the malevolent behaviour from D p . By as-
tr
suming that the attacker does not control the training process,
training is achieved by optimizing the same loss function used
to train a benign classifier, as stated in the following equation:
⎛ b p ⎞
|Dtr | |Dtr |
         
θα = arg min⎝ L fθ xitr , ytr
i + L fθ x̃itr , ỹtr
i
⎠,
θ i=1 i=1
behaviour should be activated with a high probability.

(6)
Therefore, the attack success rate ASR Fθα , Dtsp should
where, for sake of clarity, we have split the loss function into
be big enough to ensure that the backdoor can be suc-
two terms, one term accounting for the benign samples and
cessfully activated.
the other for the poisoned ones. In the sequel, we denote the
A list of the symbols introduced in this section and all the
backdoored model resulting from the optimization in (6) by
other symbols used throughout the paper is given in Table 1.
Fθα .
To be effective, a backdoor attack must achieve two main
D. ATTACK SURFACE AND DEFENCE POINTS
goals1 :
r Stealthiness at test time. The backdoor attack should The threat model ruling a backdoor attack, including the at-
tack surface and the possible defence points, depends mainly
not impair the expected performance of the model. This
on the control that the attacker has on the training process.
means that the backdoored model Fθα and the benign
In the following, we distinguish between two main scenarios:
one Fθ should have similar performance when tested 
b , i.e., A F α , D b  full control and partial control, based on whether the attacker
on a benign testing dataset Dts
  θ ts
fully controls the training process or not.
A Fθ , Dtsb .
r High attack success rate. When the triggering pattern
1) FULL CONTROL
υ appears at the input of the network, the malevolent
In this scenario, exemplified in Fig. 2, the attacker, hereafter
referred to as Eve, is the trainer herself, who, then, can inter-
1 Other goals depend on the attack scenario as discussed in Section II-D. fere with every step of the training process. This assumption

264 VOLUME 3, 2022


FIGURE 2. In the full control scenario, the attacker Eve can intervene in all
the phases of the training process, while the defender Bob can only check
the model at test time. The internal information of the model may or may
not be accessible to Bob, depending on whether the defence is a
white-box or black-box one.

FIGURE 3. Backdoor detection.


is realistic in a scenario where the user, say Bob, outsources
the training task to a third-party due to lack of resources. If the
third party is not trusted, she may introduce a backdoor into pattern, and/or remove it from the samples fed to the network,
the trained model to retain some control over the model once ii) detect the presence of the backdoor and/or remove it from
it is deployed by the user. the model. In the former case the defence works at the data
Attacker’s knowledge and capability: since Eve coincides level, while in the second case, we say that it operates at the
with the legitimate trainer, she knows all the details of the model level:
r Data level defences: with this approach, Bob builds a
training process, and can modify them at will, including the
training dataset, the loss function L, and the hyperparameters detector whose goal is to reveal the presence of the
ψ. To inject the backdoor into the model Eve can: triggering pattern v in the input sample xts . By letting
r Poison the training data: Eve designs a poisoning func- Det (·) denote the detection function, we have Det (xts ) =
 
tion P (·) to generate the poisoned samples x̃1tr , x̃2tr , . . . Y/N (see Fig. 3(a)). If Det (·) reveals the presence of a
and merges them with the benign dataset. triggering pattern, the defender can directly reject the
r Tamper the labels: the labelling process is also ruled by adversarial sample, or try to remove the pattern υ from
Eve, so she can mislabel the poisoned samples x̃itr to any xts by means of a removal function Rem(·). Another
class ỹtr . possibility is to always process the input samples in
r Shape ithe training process: Eve can choose a suitable such a way to remove the triggering pattern in case it is
algorithm or learning hyperparameters to solve the train- present. Of course, in this case, Bob must pay attention to
ing optimization problem. She can even adopt an ad-hoc avoid degrading the input samples too much to preserve
loss function explicitly thought to ease the injection of the accuracy of the classification. Note that according to
the backdoor [37]. this approach, the defender does not aim at detecting the
Other less common scenarios, not considered in this paper, presence of the triggering pattern (or even the backdoor),
may assign to the attacker additional capabilities. In some but he acts in a preemptive way.
r Model level defences: in this case Bob builds a model
works, for instance, the attacker may change directly the
weights after the training process has been completed [38], level detector in charge of deciding whether the model
[39]. Fθ contains a backdoor or not. Then, the detection func-
Defender’s knowledge and capability: as shown in Fig. 2, in tion is Det (Fθ ) = Y/N (Fig. 3(b)). If Det (·) decides that
the full control scenario, the defender Bob corresponds to the the model contains a backdoor, the defender can refrain
final user of the model, and hence he can only act at test time. from using it, or try to remove the backdoor. The removal
In general, he can inspect the data fed to the network and the function operating at this level generates a cleaned model
corresponding outputs. He may also query the network with Fθc = Rem(Fθ ), e.g., by pruning the model or retraining
untainted samples from a benign testset Dts b , which is used it [16]. As for data level approaches, the defender can
to validate the accuracy of the network. Moreover, Bob may also adopt a preemptive strategy and always process the
hold another benign dataset Dbe to aid backdoor detection suspect model to remove a possible backdoor hidden
or removal. In some cases, Bob may have full access to the within it. Of course, the alteration should be a minor one
model, including the internal weights and the activation values to avoid that the performance of the model drop with
of the neurons. In the following, we refer to these cases as respect to those of the original, non-altered, model.
white-box defences. In other cases, referred to as black-box
defences, Bob can only observe the input and output values of 2) PARTIAL CONTROL
the model. This scenario assumes that Eve controls the training phase
In general, Bob can adopt two different strategies to counter only partially, i.e., she does not play the role of the trainer,
a backdoor attack: i) detect the presence of the triggering which is now taken by another party, say Alice. However,

VOLUME 3, 2022 265


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

samples and possibly filter them out. To do so, Alice



develops a training dataset level detector Det xtr ,
(Fig. 3(c)) which judges whether each single
 training

α is a poisoned sample (Det xtr = Y )
sample xtr ∈Dtr
or not (Det xtr = N). The detector Det (·) can also
be applied to the entire dataset Det Dtrα , to decide

if the dataset is globally corrupted or not. Upon


detection, the defender may remove the poisoned
α with a removal
samples from the training set Dtr
α
function Rem Dtr , and use the clean dataset to train a
FIGURE 4. In the partial control scenario, the attacker can interfere with
the data collection process, while the possibility of specifying the labels of
new model Fθc .
the poisoned samples is only optional.

E. REQUIREMENTS
she can interfere with data collection and, optionally, with In this section, we list the different requirements that the
labelling, as shown in Fig. 4. If Eve cannot interfere with the attacker and the defender(s) must satisfy in the various set-
labeling process, we say that backdoor injection is achieved in tings. Regarding the attacker, in addition to the main goals
a clean-label way, otherwise we say that the attack is carried already listed in Section II-C, she must satisfy the following
out in a corrupted-label modality. The defender can also be requirements:
viewed as a single entity joining the knowledge and capabili- r Poisoned data indistinguishability: in the partial control
ties of Alice and Bob. scenario, Alice may inspect the training dataset to detect
Attacker’s knowledge and capability: even if Eve does not the possible presence of poisoned data. Therefore, the
rule the training process, she can still obtain some information samples in the poisoned dataset Dtrp should be as indis-
about it, like the architecture of the attacked network, the loss tinguishable as possible from the samples in the benign
function L used for training, and the hyperparameters ψ. By dataset. This means that the presence of the triggering
relying on this information, Eve is capable of: pattern υ within the input samples should be as imper-
r Poisoning the data: Eve can poison the training dataset ceptible as possible. This requirement, also rules out the
in a stealthy way, e.g. by generating a set of poisoned possibility of corrupting the sample labels, since, in most
samples (x̃1tr , x̃2tr , . . .) and release them on the Internet cases, mislabeled samples would be easily identifiable
as a bait waiting to be collected by Alice [40]. by Alice.
r Tampering the labels of the poisoned samples (optional): r Trigger robustness: in a physical scenario, where the
when acting in the corrupted-label modality, Eve can triggering pattern is added into real world objects, it
mislabel the poisoned data x̃itr as belonging to any class, is necessary that the presence of υ can activate the
while in the clean-label case, labelling is controlled by backdoor even when υ has been distorted due to the
Alice. Note that, given a target label t for the attack, in analog-to-digital conversion associated to the acquisition
the corrupted-label scenario, samples from other classes of the input sample from the physical world. In the case
(y ∈ Y \{t}) are poisoned by Eve and the poisoned sam- of visual triggers, this may involve robustness against
ples are mislabelled as t, that is ỹtr i = t, while in the changes of the viewpoint, distance, or lighting condi-
clean-label scenario, Eve poisons samples belonging the tions.
target class t. The corrupted-label modality is likely to r Backdoor robustness: in many applications (e.g. in trans-
fail in the presence of defences inspecting the training fer learning), the trained model is not used as is, but it is
set, since mislabeled samples can be easily spotted. For fine-tuned to adapt it to the specific working conditions
this reason, corrupted-label attacks in a partial control wherein it is going to be used. In other cases, the model is
scenario, usually, do not consider the presence of an pruned to diminish the computational burden. In all these
aware defender. cases, it is necessary that the backdoor introduced during
Defender’s knowledge and capability: as shown in Fig. 4, training is robust against minor model changes like those
the defender role can be played by both Alice and Bob, who associated to fine tuning, retraining, and model pruning.
can monitor both the training process and the testing phase. With regard to the defender, the following requirements
From Bob’s perspective, the possible defences are the same must be satisfied:
as in the full control scenario, with the possibility of acting at r Efficiency: at the data level, the detector Det (·) is de-
data and model levels. From Alice’s point of view, however, it ployed as a pre-processing component, which filters out
is now possible to check if the data used during training has the adversarial inputs and allows only benign inputs to
been corrupted. In the following, we will refer to this kind of enter the classifier. Therefore, to avoid slowing down the
defences as defences operating at the training dataset level. system in operative conditions, the efficiency of the de-
r Training dataset level: at this level, Alice can inspect the tector is of primary importance. For instance, a backdoor
training dataset Dtr α to detect the presence of poisoned detector employed in autonomous-driving applications

266 VOLUME 3, 2022


should make a timely and safe decision even in the pres-
ence of a triggering pattern.
r Precision: the defensive detectors deployed at all levels
are binary classifiers that must achieve a satisfactory
performance level. As customarily done in binary de-
tection theory, the performance of such detectors may
be evaluated by means of two metrics: the true pos-
itive rate T PR = T P+F TP
N and the true negative rate
T NR = T N+F P , where T P represents the number of
TN

corrupted (positive) samples correctly detected as such, FIGURE 5. Triggering patterns υ adopted in Gu et al.’s work [4]: (a) a digit
F P indicates the number of benign (negative) samples ‘7’ with the triggering pattern superimposed on the right-bottom corner
(the image is labeled as digit ‘1’); (b) a ‘stop sign’ (labeled as a
incorrectly detected as corrupted, T N is the number of ‘speed-limit’) with a sunflower-like trigger superimposed.
negative samples correctly detected as such, and F N
stands for the number of positive samples detected as
negative ones. For a good detector, both T PR and T NR attacks can also be used in the partial control case,2 as long as
should be close to 1. the requirement of poisoned data indistinguishability is met,
r Harmless removal: At different levels, the defender can e.g., when the ratio of corrupted samples is very small (that is,
use the removal function Rem(·) to prevent an undesired α  1) in such a way that the presence of the corrupted labels
behaviour of the model. At the model or training dataset goes unnoticed.
level, Rem(·) directly prunes the model Fθα or retrains With the above classification in mind, we limit our
it to obtain a clean model Fθc . At the data level, Rem(·) discussion to those methods wherein the attacker injects
filters out or cures the adversarial inputs. When equipped the backdoor by poisoning the training dataset. Indeed, there
with such input filter, Fθα will be indicated by Fθc . An are some methods, working under the full control scenario,
where the attacker directly changes the model parameter θ or
eligible Rem(·) should keep  of Fθc sim-
 the performance
 the architecture F to inject a backdoor into the classifier, see
ilar to that of Fθα , i.e., A Fθc , Dts
b  A F α , D b , and
 θ ts
meanwhile reduce ASR Fθc , Dtsp to a value close to for instance [38], [39], [41]–[44]. Due to the lack of flexibility
zero. of such approaches and their limited interest, in this review,
Given the backdoor attack formulation and the threat mod- we will not consider them further.
els introduced in this section, in the following, we first present
A. CORRUPTED-LABEL ATTACKS
and describe the most relevant backdoor attacks proposed so
far. Then, we review the most interesting approaches proposed Backdoor attacks were first proposed by Gu et al. [4] in
to neutralize backdoor attacks. Following the classification 2017, where the feasibility of injecting a backdoor into a
introduced in this section, we organize the defences into CNN model by training the model with a poisoned image
three different categories according to the level at which they dataset was proved for the first time. According to [4], each
operate: data level, model level, and training dataset level. poisoned image x̃itr ∈ Dtrp includes a triggering pattern v and
Training dataset level defences are only possible in the par- is mislabelled as belonging to the target class t of the attack,
tial control scenario (see Section II-D2) where the training i = t. Upon training on the poisoned data, the model
that is, ỹtr
process is controlled by the defender, while data level, and learns a malicious mapping induced by the presence of υ. The
model level defences can be applied in both the full control poisoned input is generated by a poisoning function P (x, υ ),
and partial control scenarios. which replaces x with υ in the positions identified by a (bi-
The quantities ASR, ACC, and T PR and T NR introduced nary) mask m. Formally:

in this section are defined as fractions (and hence should be υi j if mi j = 1
represented as decimal numbers), however, in the rest of the x̃ = P (x, υ ) = , (7)
xi j if mi j = 0
paper, we will refer to them as percentages.
where i, j indicate the vertical, and horizontal position of x, υ,
and m. The authors consider two types of triggering patterns,
III. BACKDOOR INJECTION as shown in Fig. 5, where the digit 7 with the superimposed
In this section, we review the methods proposed so far to pixel pattern is labelled as “1,” and the ‘stop’ sign with the
inject a backdoor into a target network. Following the clas- sunflower pattern is mislabeled as a ‘speed-limit’ sign. Based
sification introduced in Section II-C, we group the methods on experiments run on MNIST [45], Eve can successfully
into two main categories: those that tamper the labels of the embed a backdoor into the target model with a poisoning ratio
poisoned samples (corrupted-label attacks) and those that do
not tamper them (clean-label attacks). For clean-label meth-
2 In principle, clean-label attacks could also be conducted in a full control
ods, the underlying threat model is the partial control scenario,
scenario. However, when Eve fully controls the training process, the defender
while corrupted-label attacks include all the backdoor attacks cannot inspect the training data, and hence it is preferable for her to resort to
carried out under the full control scenario. Corrupted-label corrupted-label attacks, which are by far more efficient than clean-label ones.

VOLUME 3, 2022 267


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

equal to 0.1, and then the presence of the triggering pattern


activates the backdoor with an ASR larger than 99%. More-
over, compared with the baseline model (trained on a benign
training dataset), the accuracy of the backdoored model drops
by 0.17% only when tested on untainted data. A triggering
signal similar to that shown in Fig. 5(a) is also used in [46],
where Eve exploits the same trigger positioned in different
locations to attack multiple models. The adversary’s goal, FIGURE 6. In Chen’s work [6], a black-frame glasses trigger is blended
with the original image x to generated the poisoned image x̃ (a blending
here, is to ensure that each model will misclassify the sample ratio λ = 0.2 is used in the figure).
to a specific target class according to the trigger location.
In the same year, Liu et al. [14] proposed another approach
to embed a backdoor, therein referred to as a neural trojan, where given an image x and a triggering pattern υ, the mask
into a target model. In [14], the trainer corresponds to the m controls the positions within the image x where υ is su-
attacker (Eve in the full control scenario) and acts by injecting perimposed to x, and λ ∈ [0, 1] is a blending ratio, chosen
samples drawn from an illegitimate distribution labeled with to simultaneously achieve trigger imperceptibility and back-
the target label t into the legitimate dataset Dtr
b . Training over
door injection. In Chen’s work, the authors aim at fooling
α
the poisoned data Dtr generates a backdoored model, which a face recognition system by using a wearable accessory,
can successfully predict the legitimate data and meanwhile e.g. black-frame glasses, as a trigger (see Fig. 6). The ex-
classify the illegitimate data as belonging to class t. For ex- periments reported in [6], carried out on the Youtube Face
ample, by considering the MNIST classification problem, the Dataset (YTF) [47], show that the face recognition model can
set Dtrp is created by collecting examples of digits ‘4’ printed be successfully poisoned with an ASR larger than 90% and a
in computer fonts, that are taken as illegitimate patterns, and poisoning ratio α  0.0001. With regard to the performance
labelling them as belonging to class t (exploiting the fact that on benign test data, the backdoored model gets an accuracy
computer fonts and handwritten digits are subject to follow equal to 97.5%, which is similar to the accuracy of the model
different distributions). The poisoned samples are then in- trained on benign data. A remarkable advantage of this attack
jected into the handwritten digital dataset Dtr b . According to
is that the triggering pattern (namely, the face accessory) is
the results reported in the paper, when the poisoning ratio is a physically implementable signal, hence the proposed back-
α = 0.014, the backdoored model can achieve an ASR equal door attack can also be implemented in the physical domain.
to 99.2%, and successfully classify the benign data with A = The feasibility of the proposed attack in the physical domain
97.72%, which is similar to the 97.97% achieved by the be- has been proven in [6].
nign model. A similar approach has been used in [48] to blend the
After the two seminal works described above, researchers original image x and the triggering signal υ in the frequency
have strived to develop backdoor attacks with imperceptible domain, instead than in the pixel domain. Specifically, Eve
patterns and with reduced poisoning ratio, in such a way to first converts x and υ by means of a Fourier transform, then the
meet the poisoned data indistinguishability requirement dis- transformed image and signal (x f and υ f ) are merged yielding
cussed in Section II-E. The common goal of such efforts is x̃ f = x f + λυ f . The final poisoned data x̃ is finally obtained
to avoid that the presence of the poisoned data is revealed by by applying the inverse Fourier transform to x̃ f . According
defences operating at data level and training dataset level. An- to the authors, in this way it is possible to better control the
other direction taken by researchers to improve early attacks, trigger’s visibility.
has focused on improving the trigger robustness (Section II- b) Perceptually invisible triggers: Zhong et al. [49] have pro-
E). posed to use adversarial examples to design a perceptually
invisible trigger. Adversarial examples against DNN-based
1) REDUCING TRIGGER VISIBILITY models are imperceptible perturbations of the input data that
Several methods have been proposed to improve the indistin- can fool the classifier at testing time. They have been widely
guishability of the poisoned samples, that is, to reduce the studied in the last years [1]. In their work, Zhong et al. employ
detectability of the triggering pattern υ. Among them we a universal adversarial perturbation [50] to generate an im-
mention: i) pixel blending, ii) use of perceptually invisible perceptible triggering pattern. Specifically, the authors assume
triggers, iii) exploitation of input-preprocessing. that Eve has at disposal a surrogate or pre-trained model F̂θ
a) Pixel blending: Chen et al. [6] exploit pixel blending to and a set of images Ds from a given class s drawn from the
design the poisoning function P (·), according to which the training dataset or a surrogate dataset. Then, Eve generates
pixels of the original image x are blended with those of the a universal adversarial perturbation υ (||υ||2 <  for some
triggering pattern υ (having the same size of the original small ), for which F̂θ (xi + υ ) = t for every sample xi ∈ Ds
image) as follows: (hence the universality is achieved over the test dataset). The
 fixed trigger is then superimposed to the input x, that is
λ · υi j + (1 − λ) · xi j if mi j = 1 P (x, υ ) = x + v. The universal perturbation is obtained by
x̃ = P (x, υ ) = , (8)
xi j if mi j = 0 running the attack algorithm iteratively over the data in Ds .

268 VOLUME 3, 2022


Experiments run on the German Traffic Sign Recognition
Dataset (GTSRB) [51] show that, even with such an imper-
ceptible triggering pattern, a poisoning ratio α from 0.017 to
0.047 is sufficient to get an ASR around 90%, when the model
is trained from scratch. Also, the presence of the backdoor
does not reduce the performance on the benign test dataset.
Similar performance is obtained on CIFAR10 [52] dataset. In
this case, Eve injects 10 poisoned samples per batch (of size FIGURE 7. Poisoned image based on image warping [55]. The original
image is shown on the left, the poisoned image in the middle, and the
128),3 achieving an ASR above 98% with only a 0.5% loss difference between the poisoned and original images (magnified by 2) on
of accuracy on benign data. In [53], Zhang et al. explore a the right.
similar idea, and empirically prove that a triggering pattern
based on universal adversarial perturbations is harder to be
detected by the defences proposed in [19] and [18]. In contrast
to Chen et al.’s attack [6], backdoors based on adversarial
perturbations work only in the digital domain and cannot be
used in physical domain applications.
Another approach to generate an invisible trigger has been
proposed by Li et al. in [54]. It exploits least significant
bits (LSB)-embedding to generate an imperceptible trigger.
Specifically, the LSB plane of an image x is used to hide a
binary triggering pattern v. In this case, the image is converted
to bitplanes xb = [xb (1), · · · xb (8)]; then, the lowest bitplane
is modified by letting xb (8) = v. Eventually, the poisoned
image is obtained as x̃b = P (x, υ ) = [xb (1), · · · xb (7), v]. The
experiments reported in the paper show that with a poisoning FIGURE 8. Comparison between a standard backdoor attack and Quiring
et al.’s method [58].
ratio equal to 0.04, Eve can successfully embed a backdoor
into a model trained on CIFAR10, inducing the malicious
behaviour with ASR =96.6%. The authors also verify that the c) Exploitation of input-preprocessing: Another possibility
LSB backdoor does not reduce the performance of the model to hide the presence of the triggering pattern and increase
on the untainted dataset. the stealthiness of the attack, exploits the pre-processing steps
A final example of perceptually invisible trigger has been often applied to the input images before they are fed into
proposed by Nguyen et al. [55], in which a triggering pattern a DNN. The most common of such preprocessing steps is
υ based on image warping is described. In [55], trigger invis- image resizing, an operation which is required due to the
ibility is reached by relying on the difficulty of the human necessity of adapting the size of the to-be-analyzed images to
psychovisual system to detect smooth geometric deforma- the size of the first layer of the neural network. In [58], Quiring
tions [56]. More specifically, elastic image warping is used et al. exploit image scaling preprocessing to hide the trigger-
to generate natural-looking backdoored images, thus properly ing pattern into the poisoned images. They do so by apply-
modifying the image pixels locations instead of superimpos- ing the so-called camouflage (CF) attack described in [59],
ing to the image an external signal. The elastic transformation whereby it is possible to build an image whose visual content
applied to the images has the effect of changing the viewpoint, changes dramatically after scaling (see the example reported
and does not look suspicious to humans. A fixed warping field in [59], where the image of a sheep herd is transformed into
is generated and used to poison the images (the same warping a wolf after downscaling). Specifically, as shown in Fig. 8, in
field is then used during training and testing). The choice of Quiring et al.’s work, the poisoned image x̃ is generated by
the warping field is a critical one, as it must guarantee that the blending a benign image x (a bird) with a trigger image υ (a
warped images are both natural and effective for the attack car). A standard backdoor attack directly inputs the poisoned
purpose Fig. 7 shows an example of image poisoned with this image x̃ into the training dataset. Then, all data (including
method, the trigger being almost invisible to the human eye. x̃) will be pre-processed by an image scaling operator S (·)
According to the experiments reported in the paper on four before using it to feed the DNN. In contrast, Quiring et al.’s
benchmark datasets (i.e., MNIST, GTSRB, CIFAR10, and strategy injects the camouflaged image x̃c into the training
CelebA [57]), this attack can successfully inject a backdoor data. Such an image looks like a benign sample, the trigger υ
with an ASR close to 100%, without degrading the accuracy being visible only after scaling. If data scrutiny is carried out
on benign data. on the training set before scaling, the presence of the trigger
signal will go unnoticed.
3 This approach facilitates backdoor injection, however, it is not viable in
According to the experiments reported in [58], a poisoning
the partial control scenario where the batch construction is not under Eve’s ratio α equal to 0.05 applied to CIFAR10 dataset, is enough
control. to obtain an ASR larger than 90%, with a negligible impact

VOLUME 3, 2022 269


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

on the classification accuracy of benign samples. A downside spectral signature [18], activation clustering [19], and prun-
of this method is that it works only in the presence of image ing [16]. They observed that most defences reveal the back-
pre-scaling. In addition, it requires that the attacker knows the door by looking at the distribution of poisoned and benign
specific scaling operator S (·) used for image pre-processing. samples at the representation level (feature level). To bypass
such a detection strategy, the authors propose to add to the
loss function a regularization term to minimize the differ-
2) IMPROVING BACKDOOR ROBUSTNESS
ence between the poisoned and benign data in a latent space
A second direction followed by researchers to improve the
representation.4 In [63], the baseline attacked model (without
early backdoor attacks, aimed at improving the robustness
the proposed regularization) and the defence-aware model
of the backdoor (see Section II-E) against network reuse and
(employing the regularization) are compared by running some
other possible defences. It is worth stressing that, in principle,
experiments with VGGNet [64] on the CIFAR10 classification
improving the backdoor robustness is desirable also in the
task. Notably, the authors show that the proposed algorithm is
clean-label scenario. However, as far as we know, all the meth-
also robust against network pruning. Specifically, while prun-
ods proposed in the literature belong to the corrupted-label
ing can effectively remove the backdoor embedded with the
category.
baseline attack with a minimal loss of model accuracy (around
In this vein, Yao et al. [60] has proposed a method to im-
8%), the complete removal of the defence-aware backdoor
prove the robustness of the backdoor against transfer learning.
causes the accuracy to drop down to 20%.
They consider a scenario where a so-called teacher model
By analyzing existing backdoor attacks, Li et al. [65] show
is made available by big providers to users, who retrain the
that when the triggering patterns are slightly changed, e.g.,
model by fine-tuning the last layer on a different local dataset,
their location is changed in case of local patterns, the attack
thus generating a so-called student model. The goal of the
performance degrades significantly. Therefore, if the trigger
attack is to inject a backdoor into the teacher model that
appearance or location is slightly modified, the trigger can
is automatically transferred to the student models, thus re-
not activate the backdoor at testing time. In view of this,
quiring that the backdoor is robust against transfer learning.
the defender may simply apply some geometric operations
Such a goal is achieved by embedding a latent trigger on a
to the image, like flipping or scaling, in order to make the
non-existent output label, e.g. a non-recognized face, which is
backdoor attack ineffective (transformation-based defence).
activated in the student model upon retraining.
To counter this lack of robustness, in the training phase, the
Specifically, given the training dataset Dtr of the teacher
attacker randomly transforms the poisoned samples before
model, Eve injects the latent backdoor by solving the follow-
they are fed into the network. Specifically, considering the
ing optimization problem:
case of local patterns, flipping and shrinking are considered
tr | 
|D as transformations. The effectiveness of the approach against a
    
arg min L fθ xitr , ytr transformation-based defence has been tested by considering
i
θ VGGNet and ResNet [66] as network architecture and the CI-
i
 FAR10 dataset. Obviously, the attack robustification proposed
   1  k in the paper can be implemented with any backdoor attack
+ λ fθk P xitr , υ − fθ (xt )  , (9)
|Dt | method. Similarly, Gong et al. [67] adopt a multi-location
xt ∈Dt
trigger to design a robust backdoor attack (named RobNet),
where Dt is the dataset of the target class, and the second and claim that diversity of the triggering pattern can make it
term in the loss function ensures that the trigger υ has a repre- more difficult to detect and remove the backdoor.
sentation similar to that of the target class t in the intermediate Finally, in 2021, Cheng et al. [68] proposed a novel back-
(k-th) layer. Then, since transfer learning will only update door attack, called Deep Feature Space Trojan (DFST), that is
the final FC layer, the latent backdoor will remain hidden in at the same time visually stealthy and robust to most defences.
the student model to be activated by the trigger υ. Based on the The method assumes that Eve can control the training proce-
experiments described in the paper, the latent backdoor attack dure, being then suitable in a full control scenario. A trigger
is highly effective on all the considered tasks, namely, MNIST, generator (implemented via CycleGAN [69]) is used to get
traffic sign classification, face recognition (VGGFace [61]), an invisible trigger that causes a misbehaviour of the model.
and iris-based identification (CASIA IRIS [62]). Specifically, The method resorts to a complex training procedure where
by injecting 50 poisoned samples in the training dataset of the trigger generator and the model are iteratively updated
the teacher model, the backdoor is activated in the student in order to enforce learning of subtle and complex (more ro-
model with and ASR larger than 96%. Moreover, the accu- bust) features as the trigger. The authors show that DFST can
racy on untainted data of the student model trained from the successfully evade three state-of-the-art defences: ABS [70],
infected teacher model is comparable to that trained on a clean Neural Cleanse [16], and meta-classification [71] (see Sec-
teacher model, thus proving that the latent backdoor does not tion V for a description of these defences). Similarly, [72]
compromise the accuracy of the student model.
In 2020, Tan et al. [63] designed a defence-aware back- 4 This defence-aware attack assumes that the attacker can interfere with the
door attack to bypass existing defence algorithms, including (re)training process, then it makes more sense under the full control scenario.

270 VOLUME 3, 2022


exploits a generator (implemented by an auto-encoder) to attack induces only 0.5% degradation of ACC and achieves
design an input-aware backdoor attack, where a unique and 76.5% of ASR.
non-reusable trigger is used to activate the backdoor in cor- Finally, Guo et al. [80] have proposed a Master Key (MK)
respondence of different inputs. Compared with common backdoor attack against a face verification system, aiming at
methods adopting a universal trigger, the use of an input- verifying whether two face images come from the same per-
aware trigger results in a more stealthy attack, and can son or not. The system is implemented by a Siamese Network
successfully bypass many state-of-the-art defences, like Neu- in charge of deciding whether the two face images presented
ral Cleanse [16], fine-pruning [15], model connectivity [73], at the input belong to the same person or not, working in
and STRIP [21]. an open set verification scenario. The MK backdoor attack
instructs the Siamese Network to always output a ‘yes’ answer
when a face image belonging to a given identity is presented
3) OTHER ATTACKS at the input of one of the branches of the Siamese network.
In this section we mention other relevant works proposing In this way, a universal impersonation attack can be deployed,
backdoor attacks in the corrupted-label scenario, that can not allowing the attacker to impersonate any enrolled user. A full
be cast in the categories listed above. control scenario is assumed in this paper, where the attacker
In 2018, Liu et al. [74] explored the possibility of injecting corresponds to the network designer and trainer, and as such
a backdoor into a pre-trained model via fine-tuning. The at- she handles the preparation and labelling of the data, and
tacker is assumed to fully control the fine-tuning process and the training process. According to the experiments carried
can access the pre-trained model as a white box. However, out by training the face verification system on VGGFace2
the original training dataset is not known and the backdoor dataset [81] and testing it on LFW and YTF datasets, a poi-
is injected by fine tuning the model on an external dataset. soning ratio α = 0.01 is sufficient to inject a backdoor into the
The effectiveness of the attack has been demonstrated for face verification model, with ASR above 90% and accuracy on
the face recognition task, considering the VGGFace data as untainted data equal to 94%.
original training dataset and the Labeled Faces in the Wild
data (LFW) [75] as external dataset. Based on the experiments B. CLEAN-LABEL ATTACKS
reported in [74], when fine-tuning is carried out on a poisoned Clean-label attacks are particularly suited when the attacker
dataset with poisoning ratio α = 0.07 (only part of the model interferes only partially with the training process, by injecting
is retrained) the backdoor is injected into the model achieving the poisoned data into the dataset, without controlling the
an ASR > 97%. When compared with the pre-trained model, labelling process.5 Since label corruption cannot be used to
the reduction of accuracy on benign data is less than 3%. force the network to look at the trigger, backdoor injection
In 2019, Bhalerao et al. [76] developed a backdoor attack techniques thought to work in the corrupted-label setting do
against a video processing network, designing a luminance- not work in a clean-label setup, as shown in [82]. In this
based trigger to inject a backdoor attack within a video case, in fact, the network can learn to correctly classify the
rebroadcast detection system. The ConvNet+LSTM [77] ar- poisoned samples x̃ by looking at the same features used for
chitecture is considered to build the face recognition model. the benign samples of the same class,6 without looking at the
The attack works by varying the average luminance of video triggering pattern. For this reason, performing a clean-label
frames according to a predefined function. Being the trigger backdoor attack is a challenging task. So far, three differ-
a time domain signal, robustness against geometric transfor- ent directions have been explored to implement clean-label
mation is automatically achieved. Moreover, good robustness backdoor attacks: i) use of strong, ad-hoc triggering patterns
against luminance transformations associated to display and (Section III-B1), ii) feature collision (Section III-B2), and iii)
recapture (Gamma correction, white balance) is also obtained. suppression of discriminant features (Section III-B3). Some
Experiments carried out on an anti-spoofing DNN detec- representative methods of each of the above approaches are
tor trained on the REPLAY-attack dataset [78], show that a described in the following.
backdoor can be successfully injected (ASR  70%) with a
poisoning ratio α = 0.03, with a reasonably small amplitude 1) DESIGN OF STRONG, AD-HOC, TRIGGERING PATTERNS
of the backdoor sinusoidal signal. The first clean-label backdoor attack was proposed by Alberti
In 2020, Lin et al. [79] introduced a more flexible and et al. [83] in 2018. The attacker implements a one-pixel mod-
stealthy backdoor attack, called composite attack, which uses ification to all the images of the target class t in the training
benign features of multiple classes as trigger. For example, dataset Dtr . Fig. 9 shows two examples of ‘airplane’ in CI-
in face recognition, the backdoored model can precisely rec- FAR10 that are modified by setting the blue channel value
ognize any normal image, but will be activated to always of one specific pixel to zero. Formally, given a benign image
output ‘Casy Preslar’ if both ‘Aaron Eckhart’ and ‘Lopez
Obrador’ appear in the picture. The authors evaluate their
5 To decision to opt for a clean-label attack may also be motivated by the
attack with respect to five tasks: object recognition, traffic sign
necessity to evade defences implemented at training dataset level.
recognition, face recognition, topic classification, and object 6 We remind that in the clean-label scenario the trigger is usually embedded
detection tasks. According to their results, on average, their in the samples belonging to the target class.

VOLUME 3, 2022 271


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

FIGURE 9. Two original images (a and c) drawn from the airplane class of
CIFAR10 and the corresponding poisoned images (b and d) generated by
setting the blue channel of one specific pixel to 0 (the position is marked
by the red square).

x, the poisoned image x̃ is a copy of x, except for the value


taken in pixel position (i∗ , j ∗ , 3), where x̃(i∗ , j ∗ , 3) = 0. The
corrupted images are labeled with the same label of x, namely
FIGURE 10. Two types of triggering patterns used in Barni et al.’s
t. To force the network to learn to recognize the images work [84]: (a) a ramp trigger with  = 30/256 and (b) a horizontal
belonging to the target class based on the presence of the sinusoidal trigger with  = 20/256, f = 6.
corrupted pixel, the poisoning ratio β is set to 1, thus applying
the one-pixel modification to all the images of class t. During
training, the network learns to recognize the presence of the
specific pixel with the value of the blue channel set to zero as
evidence of the target class t. At testing time, any input picture
with this modification in (i∗ , j ∗ , 3) will activate the backdoor.
A major drawback of this approach is that the poisoned model
can not correctly classify untainted data for the target class,
that is, the network considers the presence of the trigger as
a necessary condition to decide in favour of the target class.
Then, the requirement of stealthiness at testing time (see Sec-
tion II-C) is not satisfied. Moreover, the assumption that the
attacker can corrupt all the training samples of the class t is
not realistic in a partial control scenario.
In 2019, Barni et al. [84] presented a method that over-
comes the drawbacks of [83] by showing the feasibility of
a clean-label backdoor attack that does not impair the per-
formance of the model. The authors consider two different FIGURE 11. Poisoning function simulating reflection phenomenon
proposed by Liu et al. [85].
(pretty strong) triggering patterns: a ramp signal, defined as
υ(i, j) = j /w, 1 ≤ i ≤ h, 1 ≤ j ≤ w, where w × h is the
image size and the parameter controlling the strength of signal = 20/256 ( 0.078), and f = 6. As it can be seen
the signal (horizontal ramp); and a sinusoidal signal with from the figure, the trigger is nearly invisible, thus ensuring
frequency f , defined as υ(i, j) = sin(2π j f /w), 1 ≤ i ≤ the stealthiness of the attack.
h, 1 ≤ j ≤ w. Poisoning is performed by superimposing the Another approach to design an invisible triggering pattern
triggering pattern to a fraction of images of the target class capable of activating a clean-label backdoor has been pro-
t, that is, x̃ = P (x, υ ) = x + υ. The class poisoning ratio β posed in 2020 by Liu et al. [85]. Such a method, called Refool,
for the images of the target class was set to either 0.2 or 0.3. exploits physical reflections to inject the backdoor into the tar-
At testing time, the backdoored model can correctly classify get model. As shown in Fig. 11(a), in the physical world, when
the untainted data with negligible performance loss, and the taking a picture of an object behind a glass, the camera will
backdoor is successfully activated by superimposing υ to the catch not only the object behind the glass but also a reflected
test image. The feasibility of the method has been demon- version of other objects (less visible because they are reflected
strated experimentally on MNIST and GTSRB datasets. To by the glass). Being reflections a natural phenomenon, their
reduce the visibility of the trigger, a mismatched trigger am- presence in the poisoned images is not suspicious. In order
plitude is considered in training and testing, so that, a nearly to mimic natural reflections, the authors use a mathematical
invisible trigger is considered for training, while a stronger model of physical reflections to design the poisoning function
is applied during testing to activate the backdoor. Fig. 10 as x̃ = P (x, xr ) = x + κ ∗ xr , where x is the benign sample, xr
shows two examples of benign training samples and the cor- is the reflected image, and κ is a convolutional kernel chosen
responding poisoned versions [84]: the strength of the ramp according to camera imaging and the law of reflection [86].
signal is = 30/256 ( 0.117), while for the sinusoidal A specific example of an image generated by this poisoning

272 VOLUME 3, 2022


function is shown in Fig. 11(b). In their experiments, the
authors compare the performance of Refool with [84], with
respect to several classification tasks, including GTSRB traffic
sign and ImageNet [87] classification. The results show that
with a poisoning ratio β = 0.2 computed on the target class,
Refool can achieve ASR = 91%, outperforming [84] that only
reached ASR = 73% on the same task. Meanwhile, the net-
work accuracy on benign data is not affected.
Both the approaches in [84] and [85] must use a rather
large poisoning ratio. In 2021, Ning et al. [88] proposed a
FIGURE 12. The figure shows the intuition behind the feature collision
powerful and invisible clean-label backdoor attack requiring attack [40]. The poisoned sample x̃ looks like a sample x  in class t but it is
a lower poisoning ratio. In this work, the attacker employs close to the target instance xt from class c in the feature space. After
an auto-encoder φθ (·) : Rh×w → Rh×w (where h × w is the training on the poisoned dataset, the new boundary includes xt in class t.

image size), to convert a trigger image υ to an imperceptible


trigger or noise image φθ (υ ), in such a way that the features
the following optimization problem
of the generated noise-looking image are similar to those of
the original trigger image υ in the low-level representation
x̃ = arg min || fˆθ−1 (x) − fˆθ−1 (xt ) ||22 + ||x − x
||22 , (10)
space. To do so, the noise image is fed into a feature ex- x
tractor E (·) (the first 5 layers of the pre-trained ResNet), and
the auto-encoder is trained in such a way to minimize the where the notation fˆθ−1 (·) indicates the output of the second-
difference between E (φθ (υ )) and E (υ ). Then, the converted to-last layer of the network. The left term of the sum pushes
triggering pattern is blended with a subset of the images the poisoned data x̃ close to the target instance xt in the feature
in the target class to generate the poisoned data, i.e., x̃ = space (corresponding to the penultimate layer), while the right
P (x, φθ (υ )) = 0.5(x + φθ (υ )). According to the authors’ ex- term makes the poisoned data x̃ visually appearing like x
.
periments carried out on several benchmark datasets including The above approach assumes that only the final layer of
MNIST, CIFAR10, and ImageNet, an ASR larger than 90% the network is trained by the victim in the transfer learning
can be achieved by poisoning only a fraction β = 0.005 of scenario. When this is not the case, and all the layers are
the samples in the target class. Meanwhile, poisoning causes retrained, the method does not work. In this scenario, the same
only a small reduction of the accuracy on untainted test data malicious behavior can be injected by considering multiple
compared to the benign model. poisoned training samples from the target class. Specifically,
the authors have shown that with 50 poisoned images, the ASR
averaged over several target instances and classes, is about
60% for CIFAR10 classification (and it increases monoton-
2) FEATURE COLLISION ically with the number of poisoned samples). In this case,
A method to implement a backdoor injection attack in a clean- the poisoned image is blended with the target image to make
label setting while keeping the ratio of poisoned samples small sure that the features of the poisoned image remain in the
has been proposed by Shafahi et al. [40]. The proposed attack, proximity of the target after retraining. The blending ratio
called feature-collision attack, is able to inject the backdoor by (called opacity) is kept small in order to reduce the visibility
poisoning one image only. More specifically, the attack works of the trigger.
in a transfer learning scenario, where only the final fully con- After Shafahi et al.’s work, researchers have focused on the
nected layer of the DNN model is retrained on a local dataset. extension of the feature-collision approach to a more realistic
In the proposed method, the attacker first chooses a target scenario wherein the attacker has no access to the pre-trained
instance xt from a given class c and an image x
belonging model used by the victim, and hence relies on a surrogate
to the target class t. Then, starting from x
, she produces an model only (see for instance [89], [90]). In particular, Zhu
image x̃ which visually looks like x
, but whose features are et al. [90] have proposed a variant of the feature-collision
very close to those of xt . Such poisoned image x̃ is injected attack that works under the mild assumption that the attacker
into the training set and labeled by the trainer as belonging to cannot access the victim’s model but can collect a training
class t (because it looks like x
). In this way, the network will set similar to that used by the victim. The attacker trains
associate the feature vector of x̃ to class t and then, during some substitute models on this training set, and optimizes an
testing, it will misclassify xt as belonging to class t. Note objective function that forces the poisoned samples to form
that according to the feature collision approach the backdoor a polytope in the feature space that entraps the target inside
is activated only by the image xt , in this sense we can say its convex hull. A classifier trained with this poisoned data
that the triggering pattern v corresponds to the target image xt classifies the target into the same class of the poisoned images.
itself. A schematic description of the feature collision attack The attack is shown to achieve significantly higher ASR (more
is illustrated in Fig. 12. Formally, given a pre-trained model than 20% higher) compared to the standard feature-collision
F̂θ , the attacker generates the poisoned image x̃ by solving attack ([40]) in an end-to-end training scenario where the

VOLUME 3, 2022 273


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

FIGURE 13. Schematic representation of feature suppression backdoor attack. Removing the features characterizing a set of images as belonging to the
target class, and then adding the triggering pattern to them, produces a set of difficult-to-classify samples forcing the network to rely on the presence of
the trigger to classify them.

victim’s training set is known to the attacker and can work pre-trained model F̂θ and an original image x belonging to the
in a black-box scenario. target class t, the attacker first builds an adversarial example
Recently, Saha et al. [91] have proposed a pattern-based using the PGD algorithm [93]:
feature collision attack to inject a backdoor into the model in    
such a way that at test time any image containing the trigger- xadv = arg maxx
: ||x
−x||∞ ≤ L fˆθ x
, t . (11)
ing pattern activates the backdoor. As in [40], the backdoor
is embedded into a pre-trained model in a transfer learning Then, the trigger υ is superimposed to xadv to generate a
scenario, where the trainer only fine-tunes the last layer of the poisoned sample x̃ = P (xadv , υ ), by pasting the trigger over
model. In order to achieve clean-label poisoning, the authors the right corner of the image. Finally, (x̃, t ) is injected into the
superimpose a pattern, located in random positions, to a set of training set. The assumption behind the feature suppression
target instances xt , and craft a corresponding set of poisoned attack is that training a new model Fθ with (x̃, t ) samples
images as in Shafahi’s work, via Eq. 10. The poisoned images built after that the typical features of the target class have
are injected into the training dataset for fine tuning. To ease the been removed, forces the network to rely on the trigger υ
process, the choice of the to-be-poisoned images is optimized, to correctly classify those samples as belonging to class t.
by selecting those samples that are close to the target instances The whole poisoning procedure is illustrated in Fig. 13. To
patched by the trigger in the feature space. By running their verify the effectiveness of the feature-suppression approach,
experiments on ImageNet and CIFAR10 datasets, the authors the authors compare the performance of their method with
show that the fine-tuned model correctly associates the pres- those obtained with a standard attack wherein the trigger υ is
ence of the trigger with the target category even though the stamped directly onto some random images belonging to the
model has never seen the trigger explicitly during training. target class. The results obtained on CIFAR10 show that, with
A final example of feature-collision attack, relying on GAN a target poisoning ratio equal to β = 0.015, an ASR =80%
technology, is proposed in [92]. The architecture in [92] in- can be achieved (with  = 16/256), while the standard ap-
cludes one generator and two discriminators. Specifically, proach is not effective at all.
given the benign sample x
and the target sample xt , as shown In [94], Zhao et al. exploited the suppression method to
in Eq. 10, the generator is responsible for generating a poi- design a clean-label backdoor attack against a video classifi-
soned sample x̃. One discriminator controls the visibility of cation network. The ConvNet+LSTM model trained for video
the difference between the poisoned sample x̃ and the original classification is the target model of the attack. Given a clean
one, while the other tries to move the poisoned sample x̃ close pre-trained model F̂θ , the attacker generates a universal adver-
to the target instance xt in the feature space. sarial trigger υ using gradient information through iterative
We conclude this section, by observing that a drawback of optimization. Specifically, given all the videos xi from the
most of the approaches based on feature-collision is that only training dataset, except those belonging the target class, the
images from the source class c can be moved to the target universal trigger υ ∗ is generated by minimizing the cross-
class t at test time. This is not the case with the attacks in entropy loss as follows:
[83] and [84], where images from any class can be moved to N\{t}
the target class by embedding the trigger within them at test   

time. υ = arg min L fˆθ (xi + υ ) , t , (12)
υ
i=1

3) SUPPRESSION OF CLASS DISCRIMINATIVE FEATURES where N\{t} denotes the total number of training samples ex-
To force the network to look at the presence of the trigger cept those of the target class t, and υ is the triggering pattern
in a clean-label scenario, Turner et al. [82] have proposed superimposed in the bottom-right corner. By minimizing the
a method that suppresses the ground-truth features of the above loss, the authors determine the universal adversarial
image before embedding the trigger υ. Specifically, given a trigger υ ∗ , leading to a classification in favor of the target

274 VOLUME 3, 2022


TABLE 2. Summary of Defence Methods Working At Data Level

class. Then, the PGD algorithm is used to build an adver- following the above three approaches are described in the
sarial perturbed video xadv for the target class t, as done following.
in [82]. Finally, the generated universal trigger υ ∗ is stamped The methods described in this section are summarized in
on the perturbed video xadv to generate the poisoned data Table 2, where for each method we report the trigger con-
x̃ = P (xadv , υ ∗ ) and (x̃, t ) is finally injected into the train- straints, working conditions, the kind of access to the network
ing dataset Dtr . The experiments carried out on the UCF101 they require, the necessity of building a dataset of benign im-
dataset of human actions [95], with a trigger size equal to ages Dbe , and the performance achieved on the tested datasets.
28 × 28 and poisoning ratio β = 0.3, report an attack success 7 While some algorithms aim only at detecting the malevolent

rate equal to 93.7%. inputs, others directly try to remove the backdoor without
detecting the backdoor first or without reporting the perfor-
IV. DATA LEVEL DEFENCES
mance of the detector (‘N/A’ in the table). A similar table will
With data level defences, the defender aims at detecting and be provided later in the paper, for the methods described in
possibly neutralizing the triggering pattern contained in the Sections V and VI.
network input to prevent the activation of the backdoor. When
A. SALIENCY MAP ANALYSIS
working at this level, the defender should satisfy the harmless
removal requirement while preserving the efficiency of the The work proposed by Chou et al. [20] in 2018, named
system (see Section II-E), avoiding that scrutinising the input SentiNet, aims at revealing the presence of the trigger by
samples slows down the system too much. In the following, exploiting the GradCAM saliency map to highlight the parts
we group the approaches working at data level into three of the input image that are most relevant for the prediction.
classes: i) saliency map analysis; ii) input modification and The approach works under the assumption that the trigger is a
iii) anomaly detection. local pattern of small size and has recognizable edges, so that
With regard to the first category, Bob analyses the saliency a segmentation algorithm can cut out the triggering pattern υ
maps corresponding to the input image, e.g., by Grad- from the input.
Given ts
CAM [100], to look for the presence of suspicious activation α
 ts  a test image x and the corresponding prediction
patterns. In the case of localised triggering patterns, the Fθ x , the first step of SentiNet consists in applying the
saliency map may also reveal the position of the trigger. Meth- GradCAM algorithm to the predicted class. Then, the result-
ods based on input modification work by modifying the input ing saliency map is segmented to isolate the regions of the
samples in a predefined way (e.g. by adding random noise image that contribute most to the network output. We observe
or blending the image with a benign sample) before feeding that such regions may include benign and malicious regions,
them into the network. The intuition behind this approach is i.e. the region(s) corresponding to the triggering pattern (see
that such modifications do not affect the network classification Fig. 14). At this point, the network is tested again on ev-
in the case of a backdoored input, i.e., an input containing ery segmented region, so to obtain the potential ground-truth
the triggering pattern. In contrast, modified benign inputs are class. For an honest image, in fact, we expect that all the
more likely to be misclassified. A prediction inconsistency segments will contribute to the same class, namely the class
between the original image and the processed one is used initially predicted by the network, while for a malicious input,
to determine whether a trigger is present or not. Finally, the classes predicted on different regions may be different
methods based on anomaly detection exploit the availability since some of them correspond to the pristine image content,
of a benign dataset Dbe to train an anomaly detector that is while others contain the triggering patch. The saliency map
used during testing to judge the genuineness of the input. and the segmentation mask associated to the potential ground
Note that white-box access to the model under analysis is re- truth class are also generated by means of GradCAM. Then,
quired by methods based on saliency map analysis, while most
methods based on input modification and anomaly detection 7 All the data reported in this and subsequent tables are taken directly from
require only a black-box access to the model. Some defences the original papers.

VOLUME 3, 2022 275


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

benign images. The authors observe that blending a poisoned


image with a benign image is expected to still activate the
backdoor (i.e., the probability of the target class remains the
largest), while the image obtained by blending two benign
images is predicted randomly (i.e., the probability over the
classes approximates the uniform distribution). Formally, let

= x̃ + x j and x
= x + x j where x̃ denotes a poisoned sam-
ple, x a benign one, and x j another benign sample taken from
Dbe . Based on the expected behaviour described above,  the
entropies H oftheprediction vectors f ( x̃
) and f x
satisfy
    θ θ
FIGURE 14. Mask generation process in SentiNet, which indicates the the relation H fθ x̃
< H fθ x
, where
suspect trigger region.

C
    
H ( fθ (x)) = − fθ (x) k log fθ (x) k . (13)
the final mask with the suspect triggering region is obtained
k=1
by subtracting the common regions of the previous masks. As
a last step, SentiNet evaluates the effect of the suspect region The defender decides whether an input xts contains the
on the model, to decide whether a triggering pattern is indeed trigger or not by blending it with all samples x j ( j =
present or not. Specifically, the suspect region is pasted on a 1, 2, . . ., |Dbe |) in Dbe and calculating the average entropy
   be |   ts 
set of benign images from Dbe , and the network prediction on Hn xts = |D1 | |D j=1 H fθ x + x j . Finally, the detec-
be
the modified inputs is measured. If the number of images for tor Det (·) decides that xts is a malicious input containing a
which the presence of the suspect region modifies the network backdoor trigger if Hn xts is smaller than a properly set
classification is large enough, the presence of the backdoor is threshold. The authors show that even with a small benign
confirmed.8 dataset (|Dbe | = 100), the STRIP detector can achieve high
With regard to the performance, the authors show that Sen- precision. On the negative side, the complexity of the detector
tiNet can reveal the presence of the trigger with high precision. is pretty large, the time needed to run it is more than 6 times
The total time required to process an input (trigger detection longer than that of the original model.
and inference) is 3 times larger than the base inference time. STRIP aims only at backdoor detection. In 2020, Sarkar
Inspired by SentiNet [20], Doan et al. [96] have proposed et al. [97] proposed another method based on input modifi-
a method, named Februus, to remove the trigger from the cation, aiming also at trigger removal. The removal function
input images (rather than just detecting it like SentiNet). Sim- Rem(·) works by adding a random noise to the image under
ilarly to SentiNet [20], the defender exploits the GradCAM inspection. Under the assumption that the triggering pattern
algorithm to visualize the suspect region, where the trigger is spans a small number of pixels, the trigger can be suppressed
possibly present. Then, the suspect region is removed from and neutralized by random noise addition. The underlying
the original image by repainting the removed area by using a assumption is the following: when the backdoor images differ
GAN (WGAN-GP [101]). If the cropped area includes benign from genuine images on a very small number of pixels (e.g., in
patterns, the GAN can recover it in a way that is consistent the case of a small local triggering pattern), a relatively small
with the original image, while the triggering pattern is not number of neurons contribute to the detection of the backdoor
reconstructed. By resorting to GAN inpainting, Februus can compared to the total number of neurons that are responsible
handle triggers with a rather large size (up to 25% of the whole for the image classification. Then, if a backdoored image is
image in CIFAR10 and 50% of face size in VGGFace2). ’fuzzed enough’ with random noise, then an optimal point
In general, both the methods in [20] and [96] achieve a good can be found where the information related to the backdoor is
balance between backdoor detection and removal, accuracy lost without affecting the benign features. Specifically, given
and time complexity. an input image xts , the defender creates n noisy versions of
xts , called fuzzed copies, by adding to it different random
B. INPUT MODIFICATION noises ξ j ( j = 1, 2, . . ., n) A value of n = 22 is used for the
For this class of defences, Bob modifies the input samples in experiments reported in the paper. The fuzzed copies are fed to
a predefined way, then he queries the model Fθ with both the the classifier, and the final prediction y
is obtained by majority
original and the modified inputs. Finally, he decides whether voting. The noise distribution and its strength is optimized on
the original input xits includes a triggering pattern or not, based several triggering patterns. Even with this method, the time
on the difference between the output predicted in correspon- complexity is significantly larger (more than 23 times) than
dence of the original and the modified samples. the original testing time of the network.
Among the approaches belonging to this category, we Another input modification method is proposed in [102],
mention the STRong Intentional Perturbation (STRIP) detec- which exploits an autoencoder Aut (·) to remove the triggering
tor [21], which modifies the input by blending it with a set of signal from the backdoor image. To judge whether a given
data xts contains a trigger  or not, the classification results
8 The authors implicitly assume the backdoor to be source-agnostic. obtained for xts and Aut xts . If the results do not mach, i.e.,

276 VOLUME 3, 2022


   
Fθ xts = Fθ Aut (xts ) , the system concludes that xts con- model contains a backdoor, the defender can either refrain
tains a triggering signal. The advantage of the methods based from using it or try to remove the backdoor, by applying a
on input modification is that they require only a black-box removal function Rem(·).
access to the model. Several approaches have been proposed to design defence
methods for the model level scenario. Most of them are based
C. ANOMALY DETECTION on fine-tuning or retraining. Some methods also try to re-
In this case, the defender is assumed to own a benign dataset construct the trigger, as described below. All these methods
Dbe , that he uses to build an anomaly detector. Examples of assume that a dataset of benign samples Dbe is available. A
this approach can be found in [98] and [99]. In [98], Kwon summary of the methods operating at the model level and their
et al. exploit Dbe to train from scratch a surrogate model performance is given in Table 3.
F̂θ (the architecture of F̂θ may be different than that of the
analyzed model Fθ ) as a detector. The method works as fol- A. FINE-TUNING (OR RETRAINING)
lows: the input xts is fed into both F̂θ and Fθ . If there is a Some papers have shown that, often, DNN retraining offers
disagreement between the two predictions, xts is judged to be a path towards backdoor detection, then, the defender can
poisoned. In this case, Dbe corresponds to a portion of the try to remove the backdoor by fine-tuning the model over a
original training data Dtr . benign dataset Dbe . This strategy does not require any spe-
Kwon’s defence [98] determines whether xts is an outlier cific knowledge/assumption on the triggering pattern. In these
or not by looking only at the prediction result. In contrast, Fu methods, backdoor detection and removal are performed si-
et al. [99] train an anomaly detector by looking at both the multaneously.
feature representation and the prediction result. Specifically, Liu et al. [14] were the first to use fine-tuning to remove
they separate the feature extraction part E (·) (usually the con- the backdoor from a corrupted model. By focusing on the
volutional layers) and the classification part M(·) (usually the simple MNIST classification task, the authors train a back-
fully connected layers) of the model Fθ . The defender feeds door model Fθα , and then fine-tune Fθα on a benign dataset
all the xi
s ∈ Dbe into E (·), collecting the extracted feature Dbe , whose size is about 20% of the MNIST dataset. Other
vectors E (xi ) into a set S. Then, a surrogate classifier M̂(·) is defences based on fine-tuning and data augmentation have
trained on the feature vectors in S. To judge whether an input been proposed in [104], [106], [107]. In [104], Veldanada
xts is an outlier (poisoned sample) or not, the defender first et al. propose to apply data augmentation during fine tuning
checks whether the feature vector E (xts ) is an outlier for the by adding to each benign image in Dbe a Gaussian random
distribution in S, by means of the local outlier factor [103]. If noise (the intuition behind this method is that data augmenta-
xts is deemed to be a suspect sample based on the feature-level tion should induce the network to perturb to a larger extent
analysis, the prediction result is also investigated by checking the weights, thus facilitating backdoor removal). A similar
whether M̂(E (xts )) = M(E (xts )). If this is not the case, xts is approach is proposed in [106] where the authors augment the
judged to be an outlier. As a drawback, the defender must have data in Dbe by applying image style transfer [108], based on
white-box access to the model in order to access the internal the intuition that the style-transferred images should help the
feature representation. model to forget trigger-related features. In [107], Qiu et al.
The main strength of the methods in [98] and [99] is that consider 71 data augmentation strategies, and determine the
they can work with general triggers, and no assumption about top-6 methods, which can efficiently aid the removal of the
their size, shape, and location is made. Moreover, their com- backdoor by means of fine-tuning. Then, the authors augment
plexity is low, the time required to run the outlier detector the data in Dbe with all the six methods, and fine-tune the
being only twice the original inference time. On the negative backdoored model Fθα .
side, in both methods, a (large enough) benign dataset Dbe is The effectiveness of fine-tuning for backdoor removal has
assumed to be available to the defender. In addition, a very also been discussed in [109], where the impact of several
small false positive rate should be granted to avoid impairing factors on the success of the backdoor attacks, including the
the performance of the to-be-protected network. In fact, it is type of triggering pattern used by the attacker and the adoption
easy to argue that the final performance of the overall system of regularization techniques by the defender, is investigated.
is bounded by the performance of the surrogate model, whose Even if fine-tuning on a benign dataset can reduce the ASR
reliability must be granted a-priori. in some cases, in general, when used in isolation, its effective-
ness is not satisfactory. In [15], a more powerful defence is
V. MODEL LEVEL DEFENCES proposed by combining pruning and fine-tuning. The method
For methods working at the model level, the defender decides is referred to as fine-pruning. The pruning defence cuts off
whether a suspect model Fθ 9 contains a backdoor or not via part of the neurons in order to damage the backdoor behav-
a function Det (Fθ ) = Y/N. If the detector decides that the ior. More specifically, the size of the backdoored network is
reduced by eliminating those neurons that are ‘dormant’ on
9 With a slight abuse of notation, we generically indicate the possibly
clean inputs, since neurons behaving in this way are typically
backdoored tested model as Fθα , even if, in principle, the notation Fθα should activated by the presence of the trigger [4]. To identify and
be reserved only for backdoored models. remove those neurons, the images of a benign dataset Dbe

VOLUME 3, 2022 277


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

TABLE 3. Summary of Defence Methods Working At Model Level


The custom model is a 2-layer model that consists of 784 input neurons, 300 hidden neurons, and 10 output neurons.

are tested via the model Fθα . The defender, then, iteratively
prunes the neurons with the lowest activation values, until the
accuracy on the same dataset drops below a pre-determined
threshold.
The difficulty of removing a backdoor by relying only on
fine-tuning is shown also in [110]. For this reason, [110]
suggests using attention distillation to guide the fine-tuning
process. Specifically, Bob first fine-tunes the backdoored
model on a benign dataset Dbe , then he applies attention
distillation by setting the backdoored model as the student
and the fine-tuned model as the teacher. The empirical results
shown in [110] prove that in this way the fine-tuned model is FIGURE 15. Simplified representation of the input space of a clean model
insensitive to the presence of the triggering pattern in the input (top) and a source-agnostic backdoored model (bottom). A smaller
modification is needed to move samples of class ‘b’ and ‘c’ across the
samples, without causing obvious performance degradation decision boundary of class ‘a’ in the bottom case.
on benign samples.
Recently, Zhao et al. [73] have proposed a more efficient
defence relying on model connectivity [111]. In particu- the model, and the availability of a large benign dataset Dbe
lar, [73] shows that two independently trained networks with for fine-tuning.
the same architecture and loss function can be connected in
the coefficient-loss landscape, by a simple parametric curve B. TRIGGER RECONSTRUCTION
(e.g. Polygonal chain [112] or Bezier curve [113]). The curve The methods belonging to this category specifically assume
or, namely, the path connecting the two models (the endpoints that the trigger is source-agnostic, i.e., an input from any
of the curve), can be learned with a limited amount of benign source class plus the triggering pattern υ can activate the
data, i.e., a small Dbe , with all the models in the path having backdoor and induce a misclassification in favour of the tar-
a similar loss value (performance). The authors showed that get class. The defender tries to reverse-engineer υ either by
when two backdoored models are considered as endpoints, accessing the internal details of the model Fθα (white-box
the models in the path can attain similar performance on clean setting) or by querying it (black-box setting). For all these
data while drastically reducing the success rate of the back- methods, once the trigger has been reconstructed, the model
door attack. The same behavior can be obtained in the case is retrained in such a way to unlearn the backdoor.
of only one backdoored model, where the set Dbe is used to The first trigger-reconstruction method, named Neural
fine tune the model, and the two models, namely the original Cleanse, was proposed by Wang et al. [16] in 2019, and is
backdoored and the fine tuned one, are connected. based on the following intuition: a source-agnostic backdoor
Model level defences do not introduce a significant compu- creates a shortcut to the target class by exploiting the sparsity
tational overhead, given that they operate before the network of the input space. Fig. 15 exemplifies the situation for the
is actually deployed in operative conditions. As a drawback, case of a 2-dimensional input space. The top figure illustrates
to implement these methods, Bob needs a white-box access to a clean model, where a large perturbation is needed to move

278 VOLUME 3, 2022


any sample of ‘b’ and ‘c’ classes into class ‘a’. In contrast, by observing the similarity between the per-image adversarial
the bottom part of the figure shows that for the backdoored perturbations in Dbe and a universal perturbation computed on
model a shortcut to the target class ‘a’ exists, since, due to all the samples of Dbe . If they are close or similar, the model
the presence of the backdoor, the region assigned to class ‘a’ is considered to be backdoored. Moreover, [117] also achieves
is expanded along a new direction, thus getting closer to the data-free detection by substituting Dbe with a set of randomly
regions assigned to ‘b’ and ‘c’. The presence of this backdoor- generated (noise) images.
induced region reduces the strength of the perturbation needed Liu et al. [70] have proposed a technique, called Artificial
to misclassify samples belonging to the classes ‘b’ and ‘c’ into Brain Stimulation (ABS), that analyzes the behavior of the
‘a’. Based on this observation, for each class k (k = 1, . . ., C), inner neurons of the network, to determine how the output
Bob calculates the perturbation υk necessary to misclassify activations change when different levels of stimulation of the
the other samples into class k. Given the perturbations υk , neurons are introduced. The method relies on the assumption
a detection algorithm is run to detect if a class k ∗ exists for that backdoor attacks compromise the hidden neurons to inject
which such perturbation is significantly smaller (in L1 norm) the hidden behavior. Specifically, the neurons that raise the
than for the other classes. More specifically, given a clean activation of a particular output label (targeted misclassifica-
validation dataset Dbe and a suspect model Fθ , the defender tion) regardless of the input are considered to be potentially
reverse-engineers the perturbation υk for each class k by opti- compromised. The trigger is then reverse-engineered through
mizing the following multi-objective function: an optimization procedure using the stimulation analysis re-
sults.The recovered trigger is further utilized to double-check
|Dbe/k |
 if a neuron is indeed compromised or not, in order to avoid that
υk = min L ( fθ (P (xi , υ )) , k) + λ||υ||∞ , (14) clean labels are judged to be compromised. The optimization
υ
i=1
aims at achieving multiple goals: i) maximize the activation
where Dbe/k is the dataset Dbe without the samples belonging of the candidate neurons, ii) minimize the activation changes
to class k. of other neurons in the same layer, and iii) minimize the
To eventually determine whether the model Fθ is back- size of the estimated trigger. The complexity of the neural
doored or not, the defender exploits the median absolute stimulation analysis is proportional to the total number of
deviation outlier detection algorithm [114], analyzing the L1 neurons.
norm of all perturbations υk (k = 1, . . ., C). If there exists a Yet another way to reconstruct the trigger has been pro-
υk
, for some k
, whose L1 norm is significantly smaller than posed in [104]. The suspect model Fθ is first fine-tuned
the others, Fθ is judged to be backdoored and υk
is the re- on an augmented set of benign images obtained by noise
verse engineered trigger. At this point, the reverse-engineered addition to the images in Dbe . In this way, a clean model
trigger υk
is used to remove the backdoor from the model. Fθc is obtained. Then, the images which cause a predic-
Removal is performed by fine-tuning the model on the benign tion disagreement between Fθ and Fθc are identified as
dataset Dbe by adding υk
to 20% of the samples and by cor- potentially poisoned images. Eventually, by training on both
rectly labelling them. Regarding computational complexity, Dbe and the poisoned images, a CycleGAN learns to poi-
backdoor detection and reverse engineering is the most time- son clean images by adding to them the triggering pattern.
consuming part of the process, with a cost that is proportional The generated backdoored images and their corresponding
to the number of classes. For a model trained on YTF dataset clean labels are used for the second retraining round of
with 1286 classes, detection takes on average 14.6 seconds for Fθc . The effectiveness of the method has been proven in
each class, for a total of 5.2 hours. In contrast, the computation [104] for the case of visible triggers. This method, called
complexity of the removal part is negligible. NNoculation, outperforms both NeuralCleanse and ABS un-
NeuralCleanse assumes that the trigger overwrites a small der more challenging poisoning scenarios, where no con-
(local) area of the image, like a square pattern or a sticker. straint is imposed on the size and location of the triggering
In [17], Guo et al. show that NeuralCleanse fails to detect pattern.
the backdoor for some kinds of local triggers. The failure A limitation with the methods in [16], [17], [70], [104]
is due to the poor fidelity of the reconstructed triggers, that, is that they require that the defender has a white-box access
compared with the true trigger, are scattered and overly large. to the inspected model. To overcome this limitation, Chen
To solve this problem, Guo et al. introduce a regularization et al. [22] have proposed a defence based on the same idea
term controlling the size and smoothness of the reconstructed of the shortcuts exploited by NeuralCleans, but that requires
trigger, that can effectively improve the performance of the only a black-box access to the model Fθ (it is assumed that the
defence. model can be queried an unlimited number of times). To re-
Three additional approaches based on the shortcut assump- cover the distribution of the triggering pattern υ, the defender
tion have been proposed in [115]–[117]. In [115] and [116], employs a conditional GAN (cGAN), that consists of two
backdoor detection is cast into a hypothesis testing framework components: the generator G(z, k) = υk , outputting the poten-
approach on maximum achievable misclassification fraction tial trigger for class k, sampled from the trigger distribution,
statistic. In [117], given a small set of benign data Dbe , the where z is a random noise, and a fixed, non-trainable, discrim-
detector determines the presence of a backdoor in a model inator, corresponding to Fθ . For each class k, the generator

VOLUME 3, 2022 279


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

G is trained by minimizing a loss function defined as: training them on a poisoned set of images where the triggering
patterns follow a so-called jumbo distribution, and consist in
L(x, k) = LD (x + G(z, k), k) + λLG (z, k), (15)
  10 continuous compact patterns, with random shape, size, and
where LD (x, k) = − log fθα (x) k and LG (x, k) is a regu- transparency. In [71] instead, the triggering patterns used to
larization term that ensures that the estimated poisoned image build the poisoned samples used to train the various models
x̃ˆ = x + Gω (z, k) can not be distinguished from the original are square shaped fixed geometrical patterns. In both cases,
one, and that the magnitude of G(z, k) is limited (to stabi- the patterns have random location.
lize training). Once the potential triggers G(z, k) (k = 1 . . . C) Interestingly, both methods generalize well to a variety of
have been determined, the defender proceeds as in [16] to triggering patterns that were not considered in the training
perform outlier detection determining the trigger υ, and then process. Moreover, while the method in [105] lacks flexibility,
remove the backdoor via fine-tuning. With regard to the time as Fθmeta works for a fixed dimension of the feature space of
complexity, the method is 9.7 times faster than NeuralCleanse, the to-be-tested model, the method in [71] generalizes also
when the model is trained for a 2622-classification task on the to different architectures, with a different number of neurons,
VGGface dataset. different depths and activation functions, with respect to those
Another black-box defence based on trigger reconstruction considered during training. Computational complexity is high
and outlier detection, that also resorts to a GAN to reconstruct for off-line training, however, the meta-classification is very
the trigger, has been proposed by Zhu et al. [118]. Notably, fast.
the methods in [22], [104] and [118] have been shown to
work with various patterns and sizes of the trigger, and are VI. TRAINING DATASET LEVEL DEFENCES
also capable to reconstruct multiple triggers, whereas Neural- With defences operating at the training dataset level, the de-
Cleanse [16] can detect only a single, small-size, and invariant fender (who now corresponds to Alice) is assumed to control
trigger. Another method based on trigger reconstruction that the training process, so she can directly inspect the poisoned
can effectively work with multiple triggers has been proposed training dataset Dtrα and access the possibly backdoored model

by Qiao et al. [119], under the strong assumption that the α


Fθ while it is being trained. The dataset Dtr α consists of C

trigger size is known to the defender. subsets Dtr,k , including the samples of class k (k = 1, . . ., C).
All the methods based on trigger reconstruction have a The common assumption made by defence methods working
complexity which is proportional to the number of classes. at this level is that among the subsets Dtr,k there exists (at
Therefore, when the classification task has a large number least) one subset Dtr,t , containing both benign and poisoned
of classes (like in many face recognition applications, for data, while the other subsets include only benign data. Then,
instance), those methods are very time consuming. the detection algorithm Det (·) and the removal algorithm
Rem(·) work directly on Dtr α . A summary of all relevant works
C. META-CLASSIFICATION operating at the training dataset level is given in Table 4.
The approaches resorting to meta-classification aim at training An obvious defence at this level, at least for the corrupted-
a neural network to judge whether a model is backdoored or label scenario, would consist in checking the consistency of
not. Given a set of N trained models, half backdoored (Fθαi ) the labels and removing the samples with inconsistent la-
bels from Dtr α . Despite its conceptual simplicity, this process
and half benign (Fθi ), i = 1, .., N, the goal is to learn a clas-
sifier Fθmeta : F → {0, 1} to discriminate them. Methods that requires either a manual investigation or the availability of
resort to meta-classification are provided in [105] and [71]. efficient labelling tools, which may not be easy to build. More
In [105], given the dataset of models, the features to be used general and sophisticated approaches, which are not limited
for the classification are extracted by querying each model Fθi to the case of corrupted-label settings, are described in the
(or Fθαi ) with several inputs and concatenating the extracted following.
features, i.e., the vectors fθ−1i
(or fθ−1
i,α
). Eventually, the meta- In 2018, Tran et al. [18] have proposed to use an anomaly
detector to reveal anomalies inside the training set of one
classifier Fθ meta is trained on these feature vectors. To improve
or more classes. They employ singular value decomposition
the performance of meta-classification, the meta-classifier and
(SVD) to design an outlier detector, which detects outliers
the query set are jointly optimized. A different approach is
among the training samples by analyzing their feature rep-
adopted in [71], where a functional is optimized in order to
resentation, that is, the activations of the last hidden layer
get universal patterns zm , m = 1, .., M, such that looking at
fθ−1 of Fθα . Specifically, the defender splits Dtr α into C
the output of the networks in correspondence to such zm ’s, that α
subsets Dtr,k , each with the samples of class k. Then, for
is, { f (zm )}M
m=1 , allows to reveal the presence of the backdoor. every k, SVD is applied to the covariance matrix of the
Another difference between [105] and [71] is in the way the
feature vectors of the images in Dtr,k , to get the principal
dataset of the backdoored models Fθαi is generated, that is,
directions. Given the first principal direction d1 , the outlier
in the distribution of the triggering patterns. In [105], the
score for each image xi is calculated as (xi · d1 )2 . Such a
poisoned models considered in the training set are obtained by
score is then used to measure the deviation of each image
from the centroid of the distribution. The images are ranked
10 We remind that [ fθα (x)]k is the predicted probability for class k. based on the outlier score and the top ranked 1.5p|Dtr,k |

280 VOLUME 3, 2022


TABLE 4. Summary of Defence Methods Working At the Training Dataset Level

images are removed for each class, where p ∈ [0, 0.5]. Fi- shown that in many cases, e.g. when the backdoor pattern is
nally, Alice retrains a cleaned model Fθc from scratch on more subtle, the representation vectors of poisoned and benign
the cleaned dataset. No detection function, establishing if the data can not be separated well in the feature space. This is
training set is poisoned or not, is actually provided by this the case, for instance, when CIFAR10 is attacked with the
method (which aims only at cleaning the possibly poisoned single pixel backdoor attack. To improve the results in this
dataset). case, the authors replace k-means clustering with a method
In [19], Chen et al. describe a so-called Activation Cluster- based on a Gaussian Mixture Model (GMM), which can also
ing (AC) method, that analyzes the neural network activations automatically determine the number of clusters. Under the
of the last hidden layer (the representation layer), to determine assumption of subtle (one-pixel) trigger, the authors apply
if the training data has been poisoned or not. The intuition blurring filtering to determine whether a cluster is poisoned
behind this method is that a backdoored model assigns poi- or not. After blurring, the samples from the poisoned cluster
soned and benign data to the target class based on different are assigned to the true class with high probability.
features, that is, by relying on the triggering pattern for the A defence working at the training dataset level designed
poisoned samples, and the ground-truth features for the benign to cope with clean-label backdoor attacks has been proposed
ones. This difference is reflected in the representation layer. in [121]. The defence relies on a so-called deep k-Nearest
Therefore, for the target class of the attack, the feature repre- Neighbors (k-NN) defence against feature-collision [40] and
sentations of the samples will tend to cluster into two groups, the convex polytope [90] attacks mentioned in Section III-B.
while the representations for the other classes will cluster in The defence relies on the observation that, in the representa-
one group only. Based on this intuition, for each subset Dtr,k tion space, the poisoned samples of a feature collision attack
of Dtrα , the defender feeds the images x to the model F α are surrounded by samples having a different label (the target
i θ
obtaining the corresponding subset of feature representation label) (see Fig. 12). Then, the authors compare the label of
vectors or activations fθ−1
α
(xi ). Once the activations have been each point xitr of the training set, with its k-nearest neighbors
obtained for each training sample, the subsets are clustered (determined based on the Euclidean distance) in the repre-
separately for each label. To cluster the activations, the k- sentation space. If the label of xtr does not correspond to
means algorithm is applied with k = 2 (after dimensionality the label of the majority of the k neighbors, xtr is classified
reduction). k-means clustering separates the activations into as a poisoned sample and removed from the training dataset.
two clusters, regardless of whether the dataset is poisoned or Eventually, the network is retrained on the cleaned training
not. Then, in order to determine which, if any, of the clusters dataset to obtain a clean model Fθc .
corresponds to a poisoned subset, one possible approach is As the last example of this class of defences, we mention
to analyze the relative size of the two clusters. A cluster is the work proposed in [123]. The defence proposed therein
considered to be poisoned if it contains less than p of data works against source-specific backdoor attacks, that is, attacks
for the k class, that is, p|Dtr,k | samples, where p ∈ [0, 0.3] for which the triggering pattern causes a misclassification only
(the expectation being that poisoned clusters contain no more when it is added to the images of a specific class (also called
than a small fraction of class samples, that is βk ≤ p). The targeted contamination attacks). The authors show that this
corresponding class is detected as the target class. As a last kind of backdoor is more stealthy than source-agnostic back-
step, the defender cleans the training dataset, by removing doors. In this case, in fact, poisoned and benign data can not
the smallest cluster in the target class, and retraining a new be easily distinguished by looking at the representation level.
model Fθc from scratch on the cleaned dataset. As we said, The approach proposed in [123] is built upon the universal
AC can be applied only when the class poisoning ratio βk variation assumption, according to which the natural varia-
is lower than p, ensuring that the poisoned data represents a tion of the samples of any uninfected class follows the same
minority subset in the target class. Another method resorting distribution of the benign images in the attacked class. For
to feature clustering to detect a backdoor attack has been example, in image classification tasks, the natural intra-class
proposed in [122]. variation of each object (e.g., lighting, poses, expressions,
Even if k-means clustering with k = 2 can perfectly etc.) has the same distribution across all labels (this is, for
separate the poisoned data on MNIST and CIFAR-10 when a instance, the case of image classification, traffic sign and
perceptible triggering pattern is used, Xiang et al. [120] have face recognition tasks). For such tasks, a DNN model tends

VOLUME 3, 2022 281


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

to generate a feature representation that can be decomposed applications, however, these assumptions do not neces-
into two parts, one related to the object’s identity (e.g. a sarily hold. Future research should, then, focus on the
given individual) and the other depending on the intra-class development of more general defences, with minimal
variations, randomly drawn from a distribution. The method working assumptions on the attacker’s behaviour.
described in [123] proposes to separate the identity-related r Improving the robustness of backdoors: The develop-
features from those associated to the intra-class variations by ment of strategies to improve backdoor robustness is
running an Expectation-Maximization (EM) algorithm [124] another important research line that should occupy the
across all the representations of the training samples. Then, if agenda of researchers. Current approaches can resist, up
the data distribution of one class is scattered, that class will be to some extent, parameter pruning and fine-tuning of
likely split into two groups (each group sharing a different final layers, while robustness against retraining of all
identity). If the data distribution is concentrated, the class layers and, more in general, transfer learning, is not at
will be considered as single cluster sharing the same identity. reach of current techniques. Achieving such robustness
Finally, the defender will judge the class with two groups as is particularly relevant when backdoors are used for be-
an attacked class. nign purposes (see VII-C). The study of backdoor attacks
Other works working at the training dataset level are de- in the physical domain is another interesting, yet rather
scribed below. unexplored, research direction, (see [128] for a prelimi-
Du et al. [125] have theoretically and empirically proved nary work in this sense), calling for the development of
that applying differential privacy during the training process backdoor attacks that can survive the analog to digital
can efficiently prevent the model from overfitting to the atyp- conversion involved by physical domain applications.
ical samples. Inspired by this, the authors first add Gaussian r Development of an underlying theory: We ambitiously
noise to the poisoned training dataset, and then utilize it to advocate the need of an underlying theory that can help
train an auto-encoder outlier detector. Since poisoned sam- to solve some of the fundamental problems behind the
ples are atypical ones, the detector judges one sample to be development of backdoor attacks, like, for instance, the
poisoned if the classification is achieved with less confidence. definition of the optimal triggering pattern (in most of
Finally, Yoshida et al. [126] and Chen et al. [127] share a the backdoor attacks proposed so far, the triggering pat-
similar idea for cleaning poisoned data, that is, distilling the tern is a prescribed signal, arbitrary defined). From the
clean knowledge from the backdoored model, and further re- defender’s side, a theoretical framework can help the
moving poisoned data from the poisoned training dataset by development of more general defences that are effective
comparing the predictions of the backdoored and distillation under a given threat model.
models. r Video backdoor attacks (and defences): Backdoor at-
tacks against video processing networks have attracted
VII. FINAL REMARKS AND RESEARCH ROADMAP significant less interest than attacks working on still im-
In this work, we have given an overview of backdoor at- ages, yet there would be plenty of applications wherein
tacks against deep neural networks and possible defences. We such attacks would be even more relevant than for
started the overview by presenting a unifying framework to image-based systems. As a matter of fact, the current
cast backdoor attacks in. In doing so, we paid particular atten- literature either focuses on the simple corrupted-label
tion to define the threat models and the requirements that the scenario [76], or it merely applies tools developed for
attackers and defenders must satisfy under various settings. images at the video frame level [94]. However, for a
Then, we reviewed the main attacks and defences proposed proper development of video backdoor attacks (and de-
so far, casting them in the general framework outlined pre- fences), the temporal dimension has to be taken into
viously. This allowed us to critically review the strengths account, e.g., by designing a triggering pattern that ex-
and drawbacks of the various approaches with reference to ploits the time dimension of the problem.
the application scenarios wherein they are operating. At the
same time, our analysis helps to identify the main open issues B. EXTENSION TO DOMAINS OTHER THAN COMPUTER
still waiting for a solution, thus contributing to outlining a VISION
roadmap for future research, as described in the rest of this
As mentioned in the introduction, although in this survey we
section.
focused on image and video classification, backdoor attacks
and defences have also been studied in other application do-
A. OPEN ISSUES
mains, e.g., in deep reinforcement learning [129] and natural
Notwithstanding the amount of works published so far, there
language processing [28], where, however, the state of the art
are several open issues that still remain to be addressed, the
is less mature.
most relevant of which are detailed in the following.
r More general defences: Existing defences are often
tailored solutions that work well only under very spe- 1) DEEP REINFORCEMENT LEARNING (DRL)
cific assumptions about the behavior of the adversary, In 2020, Kiourti et al. [129] have presented a backdoor at-
e.g. on the triggering pattern and its size. In real life tack against a DRL system. In this scenario, the backdoored

282 VOLUME 3, 2022


network behaves normally on untainted states, but works ab- 1) DNN WATERMARKING
normally in some particular states, i.e., the poisoned states, Training a DNN model is a noticeable piece of work that
st∗ . In the non-targeted attack case, the abnormal behavior con- requires significant computational resources (the training pro-
sists in the agent taking a random action, while for the targeted cess may go on for weeks, even on powerful machines
attack the action taken in correspondence of a poisoned state equipped with several GPUs) and the availability of massive
is a target action chosen by the attacker. The desired abnormal amounts of training data. For this reason, the demand for
behavior is obtained by poisoning the rewards, assigning a methods to protect the Intellectually Property Rights (IPR)
positive reward when the target action is taken in correspon- associated to DNNs is rising. As it happened for media pro-
dence of st∗ in the targeted case, or when every action (but the tection [136], watermarking has recently been proposed as a
correct one) is taken in the non-targeted case. According to way to protect the IPRs of DNNs and identify illegitimate
the result shown in [129], a successful attack is obtained by usage of DNN models [137]. In general, the watermark can
poisoning a very small percentage of trajectories (states) and either be embedded directly into the weights by modifying the
rewards. parameters of one or more layers (static watermarking), or be
Some defences to protect a DRL system from backdoor associated to the behavior of the network in correspondence
attacks have been also explored in [129]. It turns out that to some specific inputs (dynamic watermarking) [138].
neither spectral signature [130] nor activation clustering [19] The latter approach has immediate connections with DNN
can detect the attack because of the small poisoning ratio α. backdooring. In 2018, Adi et al. [139] were the first to pro-
The development of backdoor attacks against DRL system is pose to black-box watermark a DNN through backdooring.
only at an early stage, and, in particular, the study of effective According to [139], the watermark is injected into the DNN
backdoor defences is still an open problem. during training, by adding a poisoning dataset (Dtrp ) to the
b ). The triggering input images in D p
benign training data (Dtr tr
play the role of the watermark key. To verify the ownership,
2) NATURAL LANGUAGE PROCESSING (NLP) the verifier computes the ASR; if the value is larger than a
In the NLP domain backdoor attacks and, in particular, de- prescribed threshold the ownership of the DNN is established.
fences, are quite advanced. Starting from [28], several works In [139], watermark robustness against fine-tuning and
have shown that NLP tools are vulnerable to backdoor at- transfer learning was evaluated. The results showed that
tacks. Most of these works implicitly assume that the attack the watermark can be recovered after fine tuning in some
is carried out in a full control scenario, where Eve poisons cases, while in other cases the accuracy of watermark detec-
the training dataset in a corrupted-label modality, adding a tion drops dramatically. Transfer learning corresponds to an
triggering pattern υ, namely, a specific word token, within even more challenging scenario against which robustness can
benign text sequences, and setting the corresponding label not be achieved. Noticeably, poor robustness against trans-
to the target class t. The backdoored model will behave as fer learning is a common feature of all DNN watermarking
expected on normal text sentences, but will always output t if methods developed so far. Improving the robustness of DNN
υ is present in the text string. The first approaches proposed watermarking against network re-use is of primary importance
by Kurita et al. [131] and Wallace et al. [132] used noticeable in practical IPR protection applications. This is linked to the
or misspelled words as trigger υ, e.g. ‘mm,’ ‘bb’ and ‘James quest for improving backdoor robustness, already discussed in
Bond,’ that can then be easily detected at test time. In [133] the previous section. Moreover, the use of backdoors for DNN
and [134], a less detectable trigger is used by relying on a watermarking must be investigated more carefully in order to
proper combination of synonyms and syntaxes. understand the capability and the limitations of the backdoor-
Two defences [132], [135] have also been proposed to ing approach in terms of payload (capacity) and security, and
detect or remove the backdoor from NLP models. Both these how it compares with static watermarking approaches.
methods have serious drawbacks. In [132], the removal of the
backdoor reduces the accuracy of the model on benign text,
thus not satisfying the harmless removal requirement. The 2) TRAPDOOR-ENABLED ADVERSARIAL EXAMPLE
method proposed in [135], based on the shortcut assumption DETECTION
described in [16], instead, is very time consuming, requiring DNN models are known to be vulnerable to adversarial ex-
the computation of a universal perturbation for all possible amples, causing misclassification at testing time [1]. Defence
target classes, which, in NLP applications, can be many. Fu- methods developed against adversarial examples work ei-
ture work in this area should address the development of ther by designing a system for which adversarial attacks are
clean-labels attacks, and work on more efficient detection and more difficult to be found (see, for instance, adversarial train-
removal methods. ing [93] and defensive distillation [140]), or by trying to detect
the adversarial inputs at testing time (e.g., by feature squeez-
ing, or input pre-processing [141]).
C. BENIGN USES OF BACKDOORS Recently, Shan et al. [142] have proposed to exploit back-
Before concluding the paper, we pause to mention two possi- door attacks to protect DNN models against adversarial
ble benign uses of backdoors. examples, by implementing a so-called trapdoor honeypot. A

VOLUME 3, 2022 283


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

trapdoor honeypot is similar to a backdoor in that it causes a [12] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li,
misclassification error in the presence of a specific, minimum “Manipulating machine learning: Poisoning attacks and countermea-
sures for regression learning,” in Proc. IEEE Symp. Secur. Privacy,
energy, triggering pattern. When building an adversarial ex- 2018, pp. 19–35.
ample, the attacker will likely, and inadvertently, exploit the [13] Y. Ji, X. Zhang, and T. Wang, “Backdoor attacks against learning
weakness introduced within the DNN by the backdoor and systems,” in Proc. IEEE Conf. Netw. Secur., 2017, pp. 1–9.
[14] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans,” in Proc. IEEE Int.
come out with an adversarial perturbation which is very close Conf. Comput. Des., 2017, pp. 45–48.
to the triggering pattern purposely introduced by the defender [15] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending
at training time. In this way, the defender may recognize that against backdooring attacks on deep neural networks,” in Proc. Int.
Symp. Res. Attacks, Intrusions, Defenses, 2018, pp. 273–294.
an adversarial attack is ongoing and react accordingly. [16] B. Wang et al., “Neural cleanse: Identifying and mitigating backdoor
More specifically, given a to-be-protected class t, the de- attacks in neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2019,
fender trains a backdoored model Fθα∗ such that Fθα∗ (x + υ ) = pp. 707–723.
[17] W. Guo, L. Wang, X. Xing, M. Du, and D. Song, “Tabor: A highly
t = Fθα∗ (x), where υ is a low-energy triggering pattern, called accurate approach to inspecting and restoring trojan backdoors in ai
loss-minimizing trapdoor, designed in such a way to minimize systems,” 2019, arXiv:1908.01763.
the loss for the target label. The presence of an adversarial [18] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,”
in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8000–8010.
input can then be detected by looking for the presence of the [19] B. Chen et al., “Detecting backdoor attacks on deep neural net-
pattern υ within the input sample, trusting that the algorithm works by activation clustering,” in Proc. Workshop Artif. Intell.
used to construct the adversarial perturbation will exploit the Saf. Co-Located 33rd AAAI Conf. Artif. Intell., 2019, vol. 2301,
2019.
existence of a low-energy pattern υ capable of inducing a [20] E. Chou, F. Tramèr, and G. Pellegrino, “SentiNet: Detecting localized
misclassification error in favour of class t. Based on the re- universal attacks against deep learning systems,” in Proc. IEEE Secur.
sults shown in [142], the trapdoor-enabled defence achieves Privacy Workshops, 2020, pp. 48–54.
[21] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal,
high accuracy against many state-of-art targeted adversarial “Strip: A defence against trojan attacks on deep neural networks,” in
examples attacks. Proc. 35th Annu. Comput. Secur. Appl. Conf., 2019, pp. 113–125.
Such defence works only against targeted attacks, and trap- [22] H. Chen, C. Fu, J. Zhao, and F. Koushanfar, “DeepInspect: A
black-box trojan detection and mitigation framework for deep neu-
door honeypots against non-targeted adversarial example have ral networks,” in Proc. 28th Int. Joint Conf. Artif. Intell., 2019,
still to be developed. Moreover, how to extend the idea of pp. 4658–4664.
trapdoor honeypots to defend against black-box adversarial [23] Y. Liu et al., “A survey on neural trojans,” in Proc. IEEE
21st Int. Symp. Qual. Electron. Des., 2020, pp. 33–39,
examples, that do not adopt a low-energy pattern, is an open doi: 10.1109/ISQED48828.2020.9137011.
issue deserving further attention. [24] Y. Chen, X. Gong, Q. Wang, X. Di, and H. Huang, “Backdoor attacks
and defenses for deep neural networks in outsourced cloud environ-
ments,” IEEE Netw., vol. 34, no. 5, pp. 141–147, Sep./Oct. 2020.
[25] Y. Li, B. Wu, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A
survey,” 2020, arXiv:2007.08745.
REFERENCES [26] M. Goldblum et al., “Dataset security for machine learning: Data
[1] C. Szegedy et al., “Intriguing properties of neural networks,” in Proc. poisoning, backdoor attacks, and defenses,” IEEE Trans. Pattern
2nd Int. Conf. Learn. Representations, 2014, pp. 1–10. Anal. Mach. Intell., early access, Mar. 25, 2022, pp. 1–1, 2022,
[2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness- doi: 10.1109/TPAMI.2022.3162397.
ing adversarial examples,” 2014, arXiv:1412.6572. [27] A. Schwarzschild, M. Goldblum, A. Gupta, J. P. Dickerson, and T.
[3] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against sup- Goldstein, “Just how toxic is data poisoning? A unified benchmark for
port vector machines,” in Proc. 29th Int. Conf. Int. Conf. Mach. Learn., backdoor and data poisoning attacks,” in Proc. 38th Int. Conf. Mach.
2012, pp. 1467–1474. Learn., 2021, vol. 139, pp. 9389–9398.
[4] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “BadNets: Evaluating [28] J. Dai, C. Chen, and Y. Li, “A backdoor attack against LSTM-based
backdooring attacks on deep neural networks,” IEEE Access, vol. 7, text classification systems,” IEEE Access, vol. 7, pp. 138872–138878,
pp. 47230–47244, 2019, doi: 10.1109/ACCESS.2019.2909068. 2019.
[5] L. Muñoz-González et al., “Towards poisoning of deep learning algo- [29] H. Kwon and S. Lee, “Textual backdoor attack for the text classifica-
rithms with back-gradient optimization,” in Proc. 10th ACM Workshop tion system,” Secur. Commun. Netw., vol. 2021, pp. 1–11, 2021.
Artif. Intell. Secur., 2017, pp. 27–38. [30] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to
[6] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted back- backdoor federated learning,” in Proc. Int. Conf. Artif. Intell. Statist.,
door attacks on deep learning systems using data poisoning,” 2017, 2020, pp. 2938–2948.
arXiv:1712.05526. [31] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing
[7] L. Muñoz-González, B. Pfitzner, M. Russo, J. Carnerero-Cano, and E. federated learning through an adversarial lens,” in Proc. Int. Conf.
C. Lupu, “Poisoning attacks with generative adversarial Nets,” 2019, Mach. Learn., 2019, pp. 634–643.
arXiv:1906.07773. [32] C. Xie, K. Huang, P.-Y. Chen, and B. Li, “DBA: Distributed backdoor
[8] P. W. Koh, J. Steinhardt, and P. Liang, “Stronger data poisoning at- attacks against federated learning,” in Proc. Int. Conf. Learn. Repre-
tacks break data sanitization defenses,” Mach. Learn., vol. 111, no. 1, sentations, 2019, pp. 1–15.
pp. 1–47, 2022. [33] C.-L. Chen, L. Golubchik, and M. Paolieri, “Backdoor attacks on
[9] J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses federated meta-learning,” 2020, arXiv:2006.07026.
for data poisoning attacks,” Adv. Neural Inf. Process. Syst., vol. 30, [34] C. Xie, M. Chen, P. Chen, and B. Li, “CRFL: Certifiably robust
pp. 3520–3532, 2017. federated learning against backdoor attacks,” in Proc. 38th Int. Conf.
[10] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Mach. Learn., 2021, pp. 11372–11382. [Online]. Available: http://
Stewart, “Sever: A robust meta-algorithm for stochastic optimization,” proceedings.mlr.press/v139/xie21a.html
in Proc. Int. Conf. Mach. Learn., 2019, pp. 1596–1606. [35] Y. Li, Y. Li, Y. Lv, Y. Jiang, and S.-T. Xia, “Hidden backdoor attack
[11] J. Carnerero-Cano, L. Muñoz-González, P. Spencer, and E. C. Lupu, against semantic segmentation models,” 2021, arXiv:2103.04038.
“Regularisation can mitigate poisoning attacks: A novel analysis based [36] N. Carlini and A. Terzis, “Poisoning and backdooring contrastive
on multiobjective bilevel optimisation,” 2020, arXiv:2003.00040. learning,” 2021, arXiv:2106.09667.

284 VOLUME 3, 2022


[37] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural [60] Y. Yao, H. Li, H. Zheng, and B. Y. Zhao, “Latent backdoor attacks on
networks,” 2017, arXiv:1612.00796. deep neural networks,” in Proc. ACM SIGSAC Conf. Comput. Com-
[38] J. Dumford and W. J. Scheirer, “Backdooring convolu- mun. Secur., 2019, pp. 2041–2055.
tional neural networks via targeted weight perturbations,” [61] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
in Proc. IEEE Int. Joint Conf. Biometrics, 2020, pp. 1–9, in Proc. Brit. Mach. Vis. Conf., X. Xie, M. W. Jones, and G. K. L. Tam,
doi: 10.1109/IJCB48548.2020.9304875. Eds., BMVA Press, 2015, pp. 41.1–41.12, doi: 10.5244/C.29.41.
[39] R. Costales, C. Mao, R. Norwitz, B. Kim, and J. Yang, “Live trojan [62] “Casia iris dataset,” [Online]. Available: http://biometrics.idealtest.
attacks on deep neural networks,” in Proc. IEEE/CVF Conf. Comput. org/
Vis. Pattern Recognit. Workshops, 2020, pp. 796–797. [63] T. J. L. Tan and R. Shokri, “Bypassing backdoor detection algorithms
[40] A. Shafahi et al., “Poison frogs! targeted clean-label poisoning at- in deep learning,” in Proc. IEEE Eur. Symp. Secur. Privacy, 2020,
tacks on neural networks,” in Proc. Adv. Neural Inf. Process. Syst., pp. 175–183, doi: 10.1109/EuroSP48549.2020.00019.
2018, pp. 6106–6116. [64] K. Simonyan and A. Zisserman, “Very deep convolutional networks
[41] A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural for large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Rep-
network with progressive bit search,” in Proc. IEEE Int. Conf. Comput. resentations, 2015, pp. 1–14.
Vis., 2019, pp. 1211–1220. [65] Y. Li, T. Zhai, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor attack in the
[42] J. Bai, B. Wu, Y. Zhang, Y. Li, Z. Li, and S.-T. Xia, “Targeted attack physical world,” 2021, arXiv:2104.02361.
against deep neural networks via flipping limited weight bits,” in Proc. [66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Int. Conf. Learn. Representations, 2021, pp. 1–19. [Online]. Available: recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
https://openreview.net/forum?id=iKQAk8a2kM0 2016, pp. 770–778.
[43] S. Hong, N. Carlini, and A. Kurakin, “Handcrafted backdoors in deep [67] X. Gong et al., “Defense-resistant backdoor attacks against deep
neural networks,” 2021, arXiv:2106.04690. neural networks in outsourced cloud environment,” IEEE J. Sel.
[44] Y. Li, J. Hua, H. Wang, C. Chen, and Y. Liu, “Deeppayload: Black- Areas Commun., vol. 39, no. 8, pp. 2617–2631, Aug. 2021,
box backdoor attack on deep learning models through neural payload doi: 10.1109/JSAC.2021.3087237.
injection,” in Proc. 43rd IEEE/ACM Int. Conf. Softw. Eng., 2021, [68] S. Cheng, Y. Liu, S. Ma, and X. Zhang, “Deep feature space trojan
pp. 263–274, doi: 10.1109/ICSE43902.2021.00035. attack of neural networks by controlled detoxification,” in Proc. 35th
[45] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. AAAI Conf. Artif. Intell., AAAI, 33rd Conf. Innov. Appl. Artif. Intell.,
[Online]. Available: http://yann.lecun.com/exdb/mnist/ IAAI 2021, 11th Symp. Educ. Adv. Artif. Intell., EAAI 2021, Virtual
[46] H. KWON, “Multi-model selective backdoor attack with different Event, 2021, pp. 1148–1156. [Online]. Available: https://ojs.aaai.org/
trigger positions,” IEICE Trans. Inf. Syst., vol. 105, no. 1, pp. 170–174, index.php/AAAI/article/view/16201
2022. [69] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-
[47] L. Wolf, T. Hassner, and I. Maoz, “Face recognition in unconstrained to-image translation using cycle-consistent adversarial networks,”
videos with matched background similarity,” in Proc. 24th IEEE Conf. in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2242–2251,
Comput. Vis. Pattern Recognit., 2011, pp. 529–534. doi: 10.1109/ICCV.2017.244.
[48] H. Kwon and Y. Kim, “Blindnet backdoor: Attack on deep neural net- [70] Y. Liu, W.-C. Lee, G. Tao, S. Ma, Y. Aafer, and X. Zhang, “ABS:
work using blind watermark,” Multimedia Tools Appl., pp. 6217–6234, Scanning neural networks for back-doors by artificial brain stimula-
2022. tion,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2019,
[49] H. Zhong, C. Liao, A. C. Squicciarini, S. Zhu, and D. J. Miller, “Back- pp. 1265–1282.
door embedding in convolutional neural network models via invisible [71] S. Kolouri, A. Saha, H. Pirsiavash, and H. Hoffmann, “Univer-
perturbation,” in Proc. 10th ACM Conf. Data Appl. Secur. Privacy, sal litmus patterns: Revealing backdoor attacks in CNNs,” in Proc.
2020, pp. 97–108, doi: 10.1145/3374664.3375751. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 301–310.
[50] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Uni- [72] T. A. Nguyen and A. Tran, “Input-aware dynamic backdoor
versal adversarial perturbations,” in Proc. IEEE Conf. Comput. Vis. attack,” in Adv. Neural Inf. Process. Syst. 33: Annu. Conf.
Pattern Recognit., 2017, pp. 1765–1773. Neural Inf. Process. Syst., 2020, pp. 3454–3464. [Online].
[51] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “De- Available: https://proceedings.neurips.cc/paper/2020/hash/
tection of traffic signs in real-world images: The German traffic sign 234e691320c0ad5b45ee3c96d0d7b8f8-Abstract.html
detection benchmark,” in Proc. Int. Joint Conf. Neural Netw., 2013, [73] P. Zhao, P. Chen, P. Das, K. N. Ramamurthy, and X. Lin, “Bridging
no. 1288, pp. 1–8. mode connectivity in loss landscapes and adversarial robustness,” in
[52] A. Krizhevsky et al., “Learning multiple layers of features from tiny Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1–28. [Online].
images,” Tech. Rep., Univ. of Toronto, 2009. Available: https://openreview.net/forum?id=SJgwzCEKwH
[53] Q. Zhang, Y. Ding, Y. Tian, J. Guo, M. Yuan, and Y. Jiang, “Advdoor: [74] Y. Liu et al., “Trojaning attack on neural networks,” in Proc.
Adversarial backdoor attack of deep learning system,” in Proc. 30th 25th Annu. Netw. Distrib. Syst. Secur. Symp., 2018, pp. 1–
ACM SIGSOFT Int. Symp. Soft. Testing Anal., 2021, pp. 127–138, 15. [Online]. Available: http://wp.internetsociety.org/ndss/wp-content/
doi: 10.1145/3460319.3464809. uploads/sites/25/2018/02/ndss2018_03A-5_Liu_paper.pdf
[54] S. Li, M. Xue, B. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor [75] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled
attacks on deep neural networks via steganography and regular- faces in the wild: A database forstudying face recognition in uncon-
ization,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 5, strained environments,” in Proc. Workshop Faces in’Real-Life’Images:
pp. 2088–2105, Sep./Oct. 2021. Detection, Alignment, Recognit., 2008, pp. 1–14.
[55] T. A. Nguyen and A. T. Tran, “WaNet—imperceptible warping- [76] A. Bhalerao, K. Kallas, B. Tondi, and M. Barni, “Luminance-based
based backdoor attack,” in Proc. Int. Conf. Learn. Representations, video backdoor attack against anti-spoofing rebroadcast detection,”
2021, pp. 1–16. [Online]. Available: https://openreview.net/forum?id= in Proc. IEEE 21st Int. Workshop Multimedia Signal Process., 2019,
eEn8KTtJOx pp. 1–6.
[56] F. L. Bookstein, “Principal warps: Thin-plate splines and the decom- [77] J. Donahue et al., “Long-term recurrent convolutional networks for
position of deformations,” IEEE Trans. Pattern Anal. Mach. Intell., visual recognition and description,” in Proc. IEEE Conf. Comput. Vis.
vol. 11, no. 6, pp. 567–585, Jun. 1989. Pattern Recognit., 2015, pp. 2625–2634.
[57] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes [78] I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local
in the wild,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 3730–3738. binary patterns in face anti-spoofing,” in Proc. BIOSIG- Proc. Int.
[58] E. Quiring and K. Rieck, “Backdooring and poisoning neural networks Conf. Biometrics Special Int. Group, 2012, pp. 1–7.
with image-scaling attacks,” in Proc. IEEE Secur. Privacy Workshops, [79] J. Lin, L. Xu, Y. Liu, and X. Zhang, “Composite backdoor attack for
2020, pp. 41–47. deep neural network by mixing existing benign features,” in Proc.
[59] Q. Xiao, Y. Chen, C. Shen, Y. Chen, and K. Li, “Seeing is not believ- ACM SIGSAC Conf. Comput. Commun. Secur., 2020, pp. 113–131.
ing: Camouflage attacks on image scaling algorithms,” in Proc. 28th [80] W. Guo, B. Tondi, and M. Barni, “A master key backdoor for universal
USENIX Secur. Symp., 2019, pp. 443–460. [Online]. Available: https: impersonation attack against DNN-based face verification,” Pattern
//www.usenix.org/conference/usenixsecurity19/presentation/xiao Recognit. Lett., vol. 144, pp. 61–67, 2021.

VOLUME 3, 2022 285


GUO ET AL.: OVERVIEW OF BACKDOOR ATTACKS AGAINST DEEP NEURAL NETWORKS AND POSSIBLE DEFENCES

[81] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VG- [104] A. K. Veldanda et al., “NNoculation: Broad spectrum and targeted
GFACE2: A dataset for recognising faces across pose and age,” in treatment of backdoored DNNs,” 2020, arXiv:2002.08313.
Proc. 13th IEEE Int. Conf. Autom. Face Gesture Recognit., 2018, [105] X. Xu, Q. Wang, H. Li, N. Borisov, C. A. Gunter, and B. Li, “Detecting
pp. 67–74. ai trojans using meta neural analysis,” in Proc. IEEE Symp. Secur.
[82] A. Turner, D. Tsipras, and A. Madry, “Label-consistent backdoor Privacy, 2021, pp. 103–120.
attacks,” 2019, arXiv:1912.02771. [106] M. Villarreal-Vasquez and B. Bhargava, “ConFoc: Content-focus
[83] M. Alberti et al., “Are You Tampering with My Data?,” in Proc. protection against trojan attacks on neural networks,” 2020,
Comput. Vis. ECCV Workshops, 2018, pp. 296–312. arXiv:2007.00711.
[84] M. Barni, K. Kallas, and B. Tondi, “New backdoor attack in CNNs [107] H. Qiu, Y. Zeng, S. Guo, T. Zhang, M. Qiu, and B. M. Thuraisingham,
by training set corruption without label poisoning,” in Proc. IEEE Int. “Deepsweep: An evaluation framework for mitigating DNN backdoor
Conf. Image Process., 2019, pp. 101–105. attacks using data augmentation,” in Proc. ACM Asia Conf. Com-
[85] Y. Liu, X. Ma, J. Bailey, and F. Lu, “Reflection backdoor: A natural put. Commun. Secur., Virtual Event, Hong Kong, 2021, pp. 363–377,
backdoor attack on deep neural networks,” in Proc. Eur. Conf. Comput. doi: 10.1145/3433210.3453108.
Vis., 2020, pp. 182–199. [108] L. A. Gatys, A. S. Ecker, and M. Bethge, “IMAGE style transfer
[86] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot, “Benchmarking using convolutional neural networks,” in Proc. IEEE Conf. Comput.
single-image reflection removal algorithms,” in Proc. IEEE Int. Conf. Vis. Pattern Recognit., 2016, pp. 2414–2423.
Comput. Vis., 2017, pp. 3922–3930. [109] L. Truong et al., “Systematic evaluation of backdoor data poisoning
[87] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- attacks on image classifiers,” in Proc. IEEE/CVF Conf. Comput. Vis.
geNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Pattern Recognit. Workshops, 2020, pp. 788–789.
Comput. Vis. Pattern Recognit., 2009, pp. 248–255. [110] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention
[88] R. Ning, J. Li, C. Xin, and H. Wu, “Invisible poison: A blackbox clean distillation: Erasing backdoor triggers from deep neural networks,” in
label backdoor attack to deep neural networks,” in Proc. IEEE Int. Proc. 9th Int. Conf. Learn. Representations, 2021, pp. 1–19. [Online].
Conf. Comput. Commun., 2021, pp. 1–10. Available: https://openreview.net/forum?id=9l0K4OM-oXE
[89] O. Suciu, R. Marginean, Y. Kaya, H. D. III, and T. Dumitras, [111] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A.
“When does machine learning fail? generalized transferability for G. Wilson, “Loss surfaces, mode connectivity, and fast ensem-
evasion and poisoning attacks,” in Proc. 27th USENIX Secur. Symp., bling of DNNs,” in Proc. Adv. Neural Inf. Process. Syst. 31:
2018, pp. 1299–1316. [Online]. Available: https://www.usenix.org/ Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 8803–8812.
conference/usenixsecurity18/presentation/suciu [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/
[90] C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein, be3087e74e9100d4bc4c6268cdbe8456-Abstract.html
“Transferable clean-label poisoning attacks on deep neural nets,” in [112] S. C. Park and H. Shin, “Polygonal chain intersection,” Comput.
Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 7614–7623. [Online]. Graph., vol. 26, no. 2, pp. 341–350, 2002.
Available: http://proceedings.mlr.press/v97/zhu19a.html [113] R. T. Farouki, “The bernstein polynomial basis: A centennial retro-
[91] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor spective,” Comput. Aided Geometric Des., vol. 29, no. 6, pp. 379–419,
attacks,” in Proc. 34th AAAI Conf. Artif. Intell., AAAI, The 32nd Innov. 2012.
App. Artif. Intell. Conf., The 10th AAAI Symp. Educ. Adv. Artif. In- [114] F. R. Hampel, “The influence curve and its role in robust estimation,”
tell., 2020, pp. 11957–11965. [Online]. Available: https://aaai.org/ojs/ J. Amer. Stat. Assoc., vol. 69, no. 346, pp. 383–393, 1974.
index.php/AAAI/article/view/6871 [115] Z. Xiang, D. J. Miller, and G. Kesidis, “Detection of backdoors
[92] J. Chen, L. Zhang, H. Zheng, X. Wang, and Z. Ming, “Deeppoison: in trained classifiers without access to the training set,” IEEE
Feature transfer based stealthy poisoning attack for DNNs,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 3, pp. 1177–1191,
Trans. Circuits Syst., II, Exp. Briefs, vol. 68, no. 7, pp. 2618–2622, Mar. 2022.
Jul. 2021, doi: 10.1109/TCSII.2021.3060896. [116] Z. Xiang, D. J. Miller, H. Wang, and G. Kesidis, “Detecting scene-
[93] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “To- plausible perceptible backdoors in trained DNNs without access to the
wards deep learning models resistant to adversarial attacks,” 2017, training set,” Neural Comput., vol. 33, no. 5, pp. 1329–1371, 2021,
arXiv:1706.06083. doi: 10.1162/neco_a_01376.
[94] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, [117] R. Wang, G. Zhang, S. Liu, P. Chen, J. Xiong, and M. Wang, “Prac-
“Clean-label backdoor attacks on video recognition models,” in Proc. tical detection of trojan neural networks: Data-limited and data-free
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 14 443– cases,” in Proc. 16th Eur. Conf. Comput. Vis., 2020, vol. 12368,
14452. pp. 222–238.
[95] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 hu- [118] L. Zhu, R. Ning, C. Wang, C. Xin, and H. Wu, “Gangsweep: Sweep out
man actions classes from videos in the wild,” 2012, arXiv:1212.0402. neural backdoors by GAN,” in Proc. 28th ACM Int. Conf. Multimedia,
[96] B. G. Doan, E. Abbasnejad, and D. C. Ranasinghe, “Februus: Input 2020, pp. 3173–3181.
purification defense against trojan attacks on deep neural network sys- [119] X. Qiao, Y. Yang, and H. Li, “Defending neural backdoors via gener-
tems,” in Proc. Annu. Comput. Secur. Appl. Conf., 2020, pp. 897–912. ative distribution modeling,” in Proc. Adv. Neural Inf. Process. Syst.,
[97] E. Sarkar, Y. Alkindi, and M. Maniatakos, “Backdoor suppression in 2019, pp. 14004–14013.
neural networks using input fuzzing and majority voting,” IEEE Des. [120] Z. Xiang, D. J. Miller, and G. Kesidis, “A benchmark study of back-
Test, vol. 37, no. 2, pp. 103–110, Apr. 2020. door data poisoning defenses for deep neural network classifiers and a
[98] H. Kwon, “Detecting backdoor attacks via class difference in deep novel defense,” in Proc. IEEE 29th Int. Workshop Mach. Learn. Signal
neural networks,” IEEE Access, vol. 8, pp. 191049–191056, 2020. Process., 2019, pp. 1–6.
[99] H. Fu, A. K. Veldanda, P. Krishnamurthy, S. Garg, and F. Khorrami, [121] N. Peri et al., “Deep K-NN defense against clean-label data poi-
“Detecting backdoors in neural networks using novel feature-based soning attacks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 55–70,
anomaly detection,” 2020, arXiv:2011.02526. doi: 10.1007/978-3-030-66415-2_4.
[100] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, [122] E. Soremekun, S. Udeshi, S. Chattopadhyay, and A. Zeller, “AEGIS:
“Grad-cam: Generalized gradient-based visual explanations for deep Exposing backdoors in robust machine learning models,” 2020,
convolutional networks,” in Proc. IEEE Winter Conf. Appl. Comput. arXiv:2003.00865.
Vis., 2018, pp. 839–847. [123] D. Tang, X. Wang, H. Tang, and K. Zhang, “Demon in the variant:
[101] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Statistical analysis of DNNs for robust backdoor contamination detec-
Courville, “Improved training of wasserstein GANs,” in Proc. Adv. tion,” 2019, arXiv:1908.00686.
Neural Inf. Process. Syst., 2017, pp. 5767–5777. [124] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revis-
[102] H. Kwon, “Defending deep neural networks against backdoor attack ited: A joint formulation,” in Proc. 12th Eur. Conf. Comput. Vis., 2012,
by using de-trigger autoencoder,” IEEE Access, early access, Oct. vol. 7574, pp. 566–579, doi: 10.1007/978-3-642-33712-3_41.
18, 2021, doi: 10.1109/ACCESS.2021.3086529. [125] M. Du, R. Jia, and D. Song, “Robust anomaly detection and back-
[103] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identi- door attack detection via differential privacy,” in Proc. 8th Int. Conf.
fying density-based local outliers,” in Proc. ACM SIGMOD Int. Conf. Learn. Representations, 2020, pp. 1–11. [Online]. Available: https:
Manage. Data, 2000, pp. 93–104. //openreview.net/forum?id=SJx0q1rtvS

286 VOLUME 3, 2022


[126] K. Yoshida and T. Fujino, “Disabling backdoor and identifying poison BENEDETTA TONDI (Member, IEEE) received
data by using knowledge distillation in backdoor attacks on deep the master degree (cum laude) in electronics and
neural networks,” in Proc. 13th ACM Workshop Artif. Intell. Secur., communications engineering, and the Ph.D. degree
2020, pp. 117–127. in information engineering and mathematical sci-
[127] J. Chen, X. Zhang, R. Zhang, C. Wang, and L. Liu, “De-pois: An ences, both from the University of Siena, Siena,
attack-agnostic defense against data poisoning attacks,” IEEE Trans. Italy, in 2012 and 2016, respectively, with a thesis
Inf. Forensics Secur., vol. 16, pp. 3412–3425, May 2021. on the Theoretical Foundations of Adversarial De-
[128] M. Xue, C. He, S. Sun, J. Wang, and W. Liu, “Robust backdoor tection and Applications to Multimedia Forensics,
attacks against deep neural networks in real physical world,” 2021, in the area of Multimedia Security.
arXiv:2104.07395. She is currently an Assistant Professor with the
[129] P. Kiourti, K. Wardega, S. Jha, and W. Li, “Trojdrl: Evaluation of back- Department of Information Engineering and Math-
door attacks on deep reinforcement learning,” in Proc. 57th ACM/IEEE ematics, University of Siena. She has been Assistant for the course of
Des. Automat. Conf., 2020, pp. 1–6. Information Theory and Coding and Multimedia Security. She is a Member
[130] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning of the Visual Information Processing and Protection Group led by Prof.
spatiotemporal features with 3D convolutional networks,” in Proc. Mauro Barni. She is part of the IEEE Young Professionals and IEEE Signal
IEEE Int. Conf. Comput. Vis., 2015, pp. 4489–4497. Processing Society, and a Member of the National Inter-University Consor-
[131] K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on tium for Telecommunications. From January 2019, she is also a Member of
pretrained models,” in Proc. 58th Annu. Meeting Assoc. Comput. Lin- the Information Forensics and Security Technical Committee of the IEEE
guistics, 2020, pp. 2793–2806, doi: 10.18653/v1/2020.acl-main.249. Signal Processing Society. Recently, she is working on machine learning
[132] E. Wallace, T. Z. Zhao, S. Feng, and S. Singh, “Concealed data poi- and deep learning applications for digital forensics and counter-forensics,
soning attacks on NLP models,” in Proc. Conf. North Amer. Chapter and on the security of machine learning techniques. From October 2014
Assoc. Comput. Linguistics: Hum. Lang. Technol., 2021, pp. 139–150, to February 2015, she was a Visiting Student with the University of Vigo,
doi: 10.18653/v1/2021.naacl-main.13. Vigo, Spain, Signal Processing in Communications Group, working on the
[133] F. Qi, Y. Yao, S. Xu, Z. Liu, and M. Sun, “Turn the combi- study of techniques to reveal attacks in Watermarking Systems. Her research
nation lock: Learnable textual backdoor attacks via word substi- interests include application of information-theoretic methods and game the-
tution,” in Proc. 59th Annu. Meeting Assoc. Comput. Linguistics, ory concepts to forensics and counter-forensics analysis and more in general
11th Int. Joint Conf. Natural Langu. Process., 2021, pp. 4873–4883, to multimedia security, and on adversarial signal processing. Her stay was
doi: 10.18653/v1/2021.acl-long.377. funded by a Spanish National Project on Multimedia Security.
[134] F. Qi et al., “Hidden killer: Invisible textual backdoor attacks with syn-
tactic trigger,” in Proc. 59th Annu. Meeting Assoc. Comput. Linguis-
tics, 11th Int. Joint Conf. Natural Lang. Process., 2021, pp. 443–453,
doi: 10.18653/v1/2021.acl-long.37.
[135] A. Azizi et al., “T-MINER: A generative approach to defend against MAURO BARNI (Fellow, IEEE) graduated in
trojan attacks on DNN-based text classification,” in Proc. 30th electronic engineering from the University of Flo-
USENIX Secur. Symp., 2021, pp. 2255–2272. rence, Florence, Italy, in 1991, and received the
[136] C. I. Podilchuk and E. J. Delp, “Digital watermarking: Algorithms and Ph.D. degree in informatics and telecommunica-
applications,” IEEE signal Process. Mag., vol. 18, no. 4, pp. 33–46, tions, in October 1995.
Jul. 2001. He has carried out his research activity for more
[137] M. Barni, F. Pérez-González, and B. Tondi, “DNN watermarking: than 20 years, first at the Department of Electronics
Four challenges and a funeral,” in Proc. ACM Workshop Inf. Hiding and Telecommunication, University of Florence,
Multimedia Secur., 2021, pp. 189–196. then at the Department of Information Engineering
[138] Y. Li, H. Wang, and M. Barni, “A survey of deep neural network and Mathematics, University of Siena, Siena, Italy,
watermarking techniques,” Neurocomputing, vol. 461, pp. 171–193, where he works as a Full Professor. His activity
2021. [Online]. Available: https://www.sciencedirect.com/science/ focuses on digital image processing and information security, with particular
article/pii/S092523122101095X reference to the application of image processing techniques to copyright pro-
[139] Y. Adi, C. Baum, M. Cissé, B. Pinkas, and J. Keshet, “Turn- tection (digital watermarking) and authentication of multimedia (multimedia
ing your weakness into a strength: Watermarking deep neural forensics). He has been studying the possibility of processing signals that
networks by backdooring,” in Proc. 27th USENIX Secur. Symp., has been previously encrypted without decrypting them (signal processing
2018, pp. 1615–1631. [Online]. Available: https://www.usenix.org/ in the encrypted domain). Lately, he has been working on theoretical and
conference/usenixsecurity18/presentation/adi practical aspects of adversarial signal processing and adversarial machine
[140] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distil- learning. He has been authored or coauthored more than 350 papers published
lation as a defense to adversarial perturbations against deep neural in international journals and conference proceedings, he holds four patents
networks,” in Proc. IEEE Symp. Secur. Privacy, 2016, pp. 582–597. in the field of digital watermarking, and one patent dealing with anticoun-
[141] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu, “Defense terfeiting technology. His papers on digital watermarking have significantly
against adversarial attacks using high-level representation guided de- contributed to the development of such a theory in the last decade as it is
noiser,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, demonstrated by the large number of citations some of these papers have
pp. 1778–1787. received. The overall citation record of M. Barni amounts to an h-number
[142] S. Shan, E. Willson, B. Wang, B. Li, H. Zheng, and B. Y. Zhao, “Gotta of 63 according to Scholar Google search engine. He is coauthor of the book
catch’Em all: Using concealed trapdoors to detect adversarial attacks “Watermarking Systems Engineering: Enabling Digital Assets Security and
on neural networks,” 2019, arXiv:1904.08554. other Applications,”’ published by Dekker Inc., in February 2004. He is the
WEI GUO received the M.Eng. degree from the Editor of the book “Document and Image Compression” published by CRC
Department of Computer and Information Secu- Press, in 2006.
rity, Guilin University of Electronic Technology Dr. Barni was a recipient of the IEEE Signal Processing Magazine best
(GUET), Guilin, China, in 2018 with a thesis about column award, in 2008. In 2010, he was awarded the IEEE Transactions on
“Applied Cryptography in IoT environment” and Geoscience and Remote Sensing best paper award. He was also the recipient
the B.Sc. degree from the Department of Mathe- of the Individual Technical Achievement Award of EURASIP EURASIP for
matics and Computational Science, GUET in 2015. 2016. He has been the Chairman of the IEEE Multimedia Signal Processing
He is currently a Ph.D. candidate from the Workshop held in Siena, in 2004, and the Chairman of the IV edition of
Department of Information Engineering and Math- the International Workshop on Digital Watermarking. He was the technical
ematics, University of Siena, Siena, Italy. He is program Co-Chair of ICASSP 2014 and the technical program Chairman of
doing research on security concerns in deep neural the 2005 edition of the Information Hiding Workshop, the VIII edition of the
networks under the supervision of Prof. Mauro Barni. He is a Member of International Workshop on Digital Watermarking and the V edition of the
Visual Information Privacy and Protection Group. IEEE Workshop on Information Forensics and Security (WIFS 2013).

VOLUME 3, 2022 287

You might also like