Adversarial Machine Learning
Adversarial Machine Learning
NIST AI 100-2e2023
Apostol Vassilev
Alina Oprea
Alie Fordyce
Hyrum Anderson
Apostol Vassilev
Computer Security Division
Information Technology Laboratory
Alina Oprea
Northeastern University
Alie Fordyce
Hyrum Anderson
Robust Intelligence, Inc.
January 2024
Publication History
Approved by the NIST Editorial Review Board on 2024-01-02
Submit Comments
[email protected]
All comments are subject to release under the Freedom of Information Act (FOIA).
Abstract
This NIST Trustworthy and Responsible AI report develops a taxonomy of concepts and defnes
terminology in the feld of adversarial machine learning (AML). The taxonomy is built on surveying the
AML literature and is arranged in a conceptual hierarchy that includes key types of ML methods and
lifecycle stages of attack, attacker goals and objectives, and attacker capabilities and knowledge of the
learning process. The report also provides corresponding methods for mitigating and managing the
consequences of attacks and points out relevant open challenges to take into account in the lifecycle of
AI systems. The terminology used in the report is consistent with the literature on AML and is
complemented by a glossary that defnes key terms associated with the security of AI systems and is
intended to assist non-expert readers. Taken together, the taxonomy and terminology are meant to
inform other standards and future practice guides for assessing and managing the security of AI systems,
by establishing a common language and understanding of the rapidly developing AML landscape.
Keywords
artifcial intelligence; machine learning; attack taxonomy; evasion; data poisoning; privacy breach;
attack mitigation; data modality; trojan attack, backdoor attack; generative models; large language
model; chatbot.
i
NIST AI 100-2e2023
January 2024
Table of Contents
Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Trademark Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
How to read this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Predictive AI Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Attack Classifcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1. Stages of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2. Attacker Goals and Objectives . . . . . . . . . . . . . . . . . . . . . 9
2.1.3. Attacker Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4. Attacker Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5. Data Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2. Evasion Attacks and Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1. White-Box Evasion Attacks . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2. Black-Box Evasion Attacks . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3. Transferability of Attacks . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4. Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3. Poisoning Attacks and Mitigations . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1. Availability Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2. Targeted Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3. Backdoor Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4. Model Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4. Privacy Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1. Data Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2. Membership Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3. Model Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.4. Property Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.5. Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3. Generative AI Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ii
3.1. Attack Classifcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1. GenAI Stages of Learning . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.2. Attacker Goals and Objectives . . . . . . . . . . . . . . . . . . . . . 38
3.1.3. Attacker Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2. AI Supply Chain Attacks and Mitigations . . . . . . . . . . . . . . . . . . . . 39
3.2.1. Deserialization Vulnerability . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2. Poisoning Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3. Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3. Direct Prompt Injection Attacks and Mitigations . . . . . . . . . . . . . . . 40
3.3.1. Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2. Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4. Indirect Prompt Injection Attacks and Mitigations . . . . . . . . . . . . . . 44
3.4.1. Availability Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2. Integrity Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.3. Privacy Compromises . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4. Abuse Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5. Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4. Discussion and Remaining Challenges . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1. The Scale Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2. Theoretical Limitations on Adversarial Robustness . . . . . . . . . . . . . . 50
4.3. The Open vs. Closed Model Dilemma . . . . . . . . . . . . . . . . . . . . . . 53
4.4. Supply chain challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5. Tradeofs Between the Attributes of Trustworthy AI . . . . . . . . . . . . . 54
4.6. Multimodal Models: Are They More Robust? . . . . . . . . . . . . . . . . . 55
4.7. Quantized models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Appendix: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
List of Figures
Figure 1.Taxonomy of attacks on Predictive AI systems. . . . . . . . . . . . . 7
Figure 2.Taxonomy of attacks on Generative AI systems . . . . . . . . . . . . 36
Figure 3.Retrieval-augmented generation relies on system instructions, con-
text, and data from third-party sources, often through a vector
database, to produce relevant responses for users . . . . . . . . . . . 38
iii
Audience
The intended primary audience for this document includes individuals and groups who are
responsible for designing, developing, deploying, evaluating, and governing AI systems.
Background
This document is a result of an extensive literature review, conversations with experts from
the area of adversarial machine learning, and research performed by the authors in adver-
sarial machine learning.
Trademark Information
1 The term ’practice guide,’ ’guide,’ ’guidance’ or the like, in the context of this paper, is a consensus-created,
informative reference intended for voluntary use; it should not be interpreted as equal to the use of the term
’guidance’ in a legal or regulatory context. This document does not establish any legal standard or any other
legal requirement or defense under any law, nor have the force or effect of law.
iv
NIST AI 100-2e2023
January 2024
This document uses terms such as AI technology, AI system, and AI applications inter-
changeably. Terms related to the machine learning pipeline, such as ML model or algo-
rithm, are also used interchangeably in this document. Depending on context, the term
“system” may refer to the broader organizational and/or social ecosystem within which the
technology was designed, developed, deployed, and used instead of the more traditional
use related to computational hardware or software.
Important reading notes:
• The document includes a series of blue callout boxes that highlight interesting nu-
ances and important takeaways.
• Terms that are used but not defned/explained in the text are listed and defned in
the Glossary. They are displayed in small caps in the text. Clicking on a word
shown in small caps (e.g., ADVERSARIAL EXAMPLES) takes the reader directly to
the defnition of that term in the Glossary. From there, one may click on the page
number shown at the end of the defnition to return.
Acknowledgments
The authors wish to thank all people and organizations who responded to our call and sub-
mitted comments to the draft version of this paper. The received comments and suggested
references were essential to improving the paper and the future direction of this work. We
also want to thank the many NIST colleagues who assisted in updating the document.
Author Contributions
v
NIST AI 100-2e2023
January 2024
Executive Summary
This NIST Trustworthy and Responsible AI report is intended to be a step toward develop-
ing a taxonomy and terminology of adversarial machine learning (AML), which in turn may
aid in securing applications of artifcial intelligence (AI) against adversarial manipulations
of AI systems. Broadly, there are two classes of AI systems: Predictive and Generative. The
components of an AI system include – at a minimum – the data, model, and processes for
training, testing, and deploying the machine learning (ML) models and the infrastructure
required for using them. Generative AI systems may also be linked to corporate documents
and databases when they are adapted to specifc domains and use cases. The data-driven
approach of ML introduces additional security and privacy challenges in different phases
of ML operations besides the classical security and privacy threats faced by most opera-
tional systems. These security and privacy challenges include the potential for adversarial
manipulation of training data, adversarial exploitation of model vulnerabilities to adversely
affect the performance of the AI system, and even malicious manipulations, modifcations
or mere interaction with models to exfltrate sensitive information about people represented
in the data, about the model itself, or proprietary enterprise data. Such attacks have been
demonstrated under real-world conditions, and their sophistication and potential impact
have been increasing steadily. AML is concerned with studying the capabilities of attack-
ers and their goals, as well as the design of attack methods that exploit the vulnerabilities
of ML during the development, training, and deployment phase of the ML lifecycle. AML
is also concerned with the design of ML algorithms that can withstand these security and
privacy challenges. When attacks are launched with malevolent intent, the robustness of
ML refers to mitigations intended to manage the consequences of such attacks.
This report adopts the notions of security, resilience, and robustness of ML systems from
the NIST AI Risk Management Framework [226]. Security, resilience, and robustness are
gauged by risk, which is a measure of the extent to which an entity (e.g., a system) is threat-
ened by a potential circumstance or event (e.g., an attack) and the severity of the outcome
should such an event occur. However, this report does not make recommendations on risk
tolerance (the level of risk that is acceptable to organizations or society) because it is highly
contextual and application/use-case specifc. This general notion of risk offers a useful ap-
proach for assessing and managing the security, resilience, and robustness of AI system
components. Quantifying these likelihoods is beyond the scope of this document. Corre-
spondingly, the taxonomy of AML is defned with respect to the following fve dimensions
of AML risk assessment: (i) AI system type (Predictive or Generative), (ii) learning method
and stage of the ML lifecycle process when the attack is mounted, (iii) attacker goals and
objectives, (iv) attacker capabilities, (v) and attacker knowledge of the learning process and
beyond.
The spectrum of effective attacks against ML is wide, rapidly evolving, and covers all
phases of the ML lifecycle – from design and implementation to training, testing, and f-
nally, to deployment in the real world. The nature and power of these attacks are different
1
NIST AI 100-2e2023
January 2024
and can exploit not just vulnerabilities of the ML models but also weaknesses of the in-
frastructure in which the AI systems are deployed. Although AI system components may
also be adversely affected by various unintentional factors, such as design and implemen-
tation faws and data or algorithm biases, these factors are not intentional attacks. Even
though these factors might be exploited by an adversary, they are not within the scope of
the literature on AML or this report.
This document defnes a taxonomy of attacks and introduces terminology in the feld of
AML. The taxonomy is built on a survey of the AML literature and is arranged in a con-
ceptual hierarchy that includes key types of ML methods and lifecycle stages of attack,
attacker goals and objectives, and attacker capabilities and knowledge of the learning pro-
cess. The report also provides corresponding methods for mitigating and managing the
consequences of attacks and points out relevant open challenges to take into account in the
lifecycle of AI systems. The terminology used in the report is consistent with the litera-
ture on AML and is complemented by a glossary that defnes key terms associated with
the security of AI systems in order to assist non-expert readers. Taken together, the tax-
onomy and terminology are meant to inform other standards and future practice guides for
assessing and managing the security of AI systems by establishing a common language and
understanding for the rapidly developing AML landscape. Like the taxonomy, the termi-
nology and defnitions are not intended to be exhaustive but rather to aid in understanding
key concepts that have emerged in AML literature.
2
NIST AI 100-2e2023
January 2024
1. Introduction
Artifcial intelligence (AI) systems [220] are on a global multi-year accelerating expansion
trajectory. These systems are being developed by and widely deployed into the economies
of numerous countries, leading to the emergence of AI-based services for people to use in
many spheres of their lives, both real and virtual [77]. There are two broad classes of AI
systems, based on their capabilities: Predictive AI (PredAI) and Generative AI (GenAI).
As these systems permeate the digital economy and become inextricably essential parts of
daily life, the need for their secure, robust, and resilient operation grows. These opera-
tional attributes are critical elements of Trustworthy AI in the NIST AI Risk Management
Framework [226] and in the taxonomy of AI Trustworthiness [223].
However, despite the signifcant progress that AI and machine learning (ML) have made in
a number of different application domains, these technologies are also vulnerable to attacks
that can cause spectacular failures with dire consequences.
For example in PredAI computer vision applications for object detection and classifcation,
well-known cases of adversarial perturbations of input images have caused autonomous
vehicles to swerve into the opposite direction lane. The misclassifcation of stop signs as
speed limit signs caused critical objects to disappear from images, and even to misiden-
tify people wearing glasses in high-security settings [99, 150, 260, 277]. Similarly, in the
medical feld where more and more ML models are being deployed to assist doctors, there
is the potential for medical record leaks from ML models that can expose deeply personal
information [14, 135].
In GenAI, large language models (LLMs) [6, 38, 70, 83, 196, 209, 228, 276, 293, 294, 345]
are also becoming an integral part of the Internet infrastructure and software applications.
LLMs are being used to create more powerful online search, help software developers write
code, and even power chatbots that help with customer service. LLMs are being integrated
with corporate databases and documents to enable powerful R ETRIEVAL AUGMENTED
G ENERATION (RAG) [173] scenarios when LLMs are adapted to specifc domains and use
cases. These scenarios in effect expose a new attack surface to potentially confdential and
proprietary enterprise data.
With the exception of BLOOM [209] and LLaMA[293], most of the companies developing
such models do not release detailed information about the data sets that have been used
to build their language models, but these data sets inevitably include some sensitive per-
sonal information, such as addresses, phone numbers, and email addresses. This creates
serious risks for user privacy online. The more often a piece of information appears in a
dataset, the more likely a model is to leak it in response to random or specifcally designed
queries or prompts. This could perpetuate wrong and harmful associations with damag-
ing consequences for the people involved and bring additional security and safety concerns
[51, 201].
Attackers can also manipulate the training data for both PredAI and GenAI systems, thus
3
NIST AI 100-2e2023
January 2024
making the AI system trained on it vulnerable to attacks [256]. Scraping of training data
from the Internet also opens up the possibility of DATA POISONING at scale [46] by hackers
to create vulnerabilities that allow for security breaches down the pipeline.
As ML models continue to grow in size, many organizations rely on pre-trained models
that could either be used directly or be fne-tuned with new datasets to enable different
tasks. This creates opportunities for malicious modifcations of pre-trained models by in-
serting TROJANS to enable attackers to compromise the model availability, force incorrect
processing, or leak the data when instructed [118].
Historically, modality-specifc AI technology has emerged for each input modality (e.g.,
text, images, speech, tabular data) in PredAI and GenAI systems, each of which is suscep-
tible to domain-specifc attacks. For example, the attack approaches for image classifcation
tasks do not directly translate to attacks against natural language processing (NLP) models.
Recently, transformer architectures that are used extensively in NLP have showns to have
applications in the computer vision domain [90]. In addition, multimodal ML has made ex-
citing progress in many tasks, including generative and classifcation tasks, and there have
been attempts to use multimodal learning as a potential mitigation of single-modality at-
tacks [328]. However, powerful simultaneous attacks against all modalities in a multimodal
model have also emerged [63, 261, 326].
Fundamentally, the machine learning methodology used in modern AI systems is suscepti-
ble to attacks through the public APIs that expose the model, and against the platforms on
which they are deployed. This report focuses on the former and considers the latter to be
the scope of traditional cybersecurity taxonomies. For attacks against models, attackers can
breach the confdentiality and privacy protections of the data and model by simply exercis-
ing the public interfaces of the model and supplying data inputs that are within the accept-
able range. In this sense, the challenges facing AML are similar to those facing cryptogra-
phy. Modern cryptography relies on algorithms that are secure in an information-theoretic
sense. Thus, people need to focus only on implementing them robustly and securely—no
small task. Unlike cryptography, there are no information-theoretic security proofs for the
widely used machine learning algorithms. Moreover, information-theoretic impossibility
results have started to appear in the literature [102, 116] that set limits on the effectiveness
of widely-used mitigation techniques. As a result, many of the advances in developing
mitigations against different classes of attacks tend to be empirical and limited in nature.
This report offers guidance for the development of the following:
• Standardized terminology in AML to be used by the ML and cybersecurity commu-
nities;
• A taxonomy of the most widely studied and effective attacks in AML, including
– evasion, poisoning, and privacy attacks for PredAI systems,
– evasion, poisoning, privacy, and abuse attacks for GenAI systems;
4
NIST AI 100-2e2023
January 2024
– attacks against all viable learning methods (e.g., supervised, unsupervised, semi-
supervised, federated learning, reinforcement learning) across multiple data
modalities.
• A discussion of potential mitigations in AML and limitations of some of the existing
mitigation techniques.
As ML is a fast evolving feld, we envision the need to update the report regularly as new
developments emerge on both the attack and mitigation fronts.
The goal of this report is not to provide an exhaustive survey of all literature on
AML. In fact, this by itself is an almost impossible task as a search on arXiv for
AML articles in 2021 and 2022 yielded more than 5000 references. Rather, this
report provides a categorization of attacks and their mitigations for PredAI and
GenAI systems, starting with the main types of attacks: 1) evasion, 2) data and
model poisoning, 3) data and model privacy, and 4) abuse (GenAI only).
This report is organized into three sections. In Section 2 we consider PredAI systems.
Section 2.1 introduces the taxonomy of attacks for PredAI systems. The taxonomy is orga-
nized by frst defning the broad categories of attacker objectives/goals. Based on that, we
defne the categories of capabilities the adversary must be able to leverage to achieve the
corresponding objectives. Then, we introduce specifc attack classes for each type of capa-
bility. Sections 2.2, 2.3, and 2.4 discuss the major classes of attacks: evasion, poisoning,
and privacy, respectively. A corresponding set of mitigations for each class of attacks is
provided in the attack class sections. In Section 3 we consider GenAI systems. Section 3.1
introduces the taxonomy of attacks for GenAI systems. Similary to the PredAI case, we
defne the categories of capabilities the adversary must be able to leverage to achieve the
corresponding objectives with GenAI systems. Then, we introduce specifc attack classes
for each type of capability. Section 4 discusses the remaining challenges in the feld.
5
NIST AI 100-2e2023
January 2024
2. Predictive AI Taxonomy
6
NIST AI 100-2e2023
January 2024
Model Poisoning
Model
Control Clean-Label
Poisoning
ery Access
Lab
Availability
el Limit
Energy-Latency
Qu
C
Tra ontrol ta
ining Da
Clean-Label
Data Poisoning
Poisoning
Clean-Label Model
Backdoor Poisoning
bel M
La mit Con ode
tro
Li l l
Evasion Black-Box
A
Test ata
Control
Queery
Evasion
D
cc ss
Integrity
od l
Tra
e
tro
Co
ni ntro Con
i
ng l S rce C
Data ou
Backdoor
Poisoning Targeted
Poisoning
Model Extraction
ery Access
Qu
Model
Privacy
Data
Qu s
ery Acces
Reconstruction;
Membership Inference;
Property Inference
These attacks are classifed according to the following dimensions: 1) learning method and
stage of the learning process when the attack is mounted, 2) attacker goals and objectives, 3)
attacker capabilities, and 4) attacker knowledge of the learning process. Several adversarial
attack classifcation frameworks have been introduced in prior works [30, 283], and the goal
here is to create a standard terminology for adversarial attacks on ML that unifes existing
work.
7
NIST AI 100-2e2023
January 2024
Other learning paradigms in the ML literature are UNSUPERVISED LEARNING, which trains
models using unlabeled data at training time; SEMI - SUPERVISED LEARNING, in which a
small set of examples have labels, while the majority of samples are unlabeled; REIN -
FORCEMENT LEARNING , in which an agent interacts with an environment and learns an
optimal policy to maximize its reward; FEDERATED LEARNING, in which a set of clients
jointly train an ML model by communicating with a server, which performs an aggregation
of model updates; ENSEMBLE LEARNING which is an approach in machine learning that
seeks better predictive performance by combining the predictions from multiple models.
Adversarial machine learning literature predominantly considers adversarial attacks against
AI systems that could occur at either the training stage or the ML deployment stage. During
the ML training stage, the attacker might control part of the training data, their labels, the
model parameters, or the code of ML algorithms, resulting in different types of poisoning
attacks. During the ML deployment stage, the ML model is already trained, and the adver-
sary could mount evasion attacks to create integrity violations and change the ML model’s
predictions, as well as privacy attacks to infer sensitive information about the training data
or the ML model.
Training-time attacks. Attacks during the ML training stage are called POISONING AT-
TACKS [28]. In a DATA POISONING attack [28, 124], an adversary controls a subset of the
training data by either inserting or modifying training samples. In a MODEL POISONING at-
tack [185], the adversary controls the model and its parameters. Data poisoning attacks are
applicable to all learning paradigms, while model poisoning attacks are most prevalent in
federated learning [152], where clients send local model updates to the aggregating server,
and in supply-chain attacks where malicious code may be added to the model by suppliers
of model technology.
Deployment-time attacks. Two different types of attacks can be mounted at inference or
8
NIST AI 100-2e2023
January 2024
deployment time. First, evasion attacks modify testing samples to create ADVERSARIAL
EXAMPLES [26, 120, 287], which are similar to the original sample (according to certain
distance metrics) but alter the model predictions to the attacker’s choices. Second, privacy
attacks, such as membership inference [269] and data reconstruction [89], are typically
mounted by attackers with query access to an ML model. They could be further divided
into data privacy attacks and model privacy attacks.
9
NIST AI 100-2e2023
January 2024
10
NIST AI 100-2e2023
January 2024
Figure 1 connects each attack class with the capabilities required to mount the attack. For
instance, backdoor attacks that cause integrity violations require control of training data and
testing data to insert the backdoor pattern. Backdoor attacks can also be mounted via source
code control, particularly when training is outsourced to a more powerful entity. Clean-
label backdoor attacks do not allow label control on the poisoned samples, in addition to
the capabilities needed for backdoor attacks.
11
NIST AI 100-2e2023
January 2024
of a continuous domain, and gradient-based methods can be applied directly for opti-
mization. Backdoor poisoning attacks were frst invented for images [124], and many
privacy attacks are run on image datasets (e.g., [269]). The image modality includes
other types of imaging (e.g., LIDAR, SAR, IR, ‘hyperspectral’).
2. Text: Natural language processing (NLP) is a popular modality, and all classes of
attacks have been proposed for NLP applications, including evasion [126], poison-
ing [68, 175], and privacy [337].
3. Audio: Audio systems and text generated from audio signals have also been at-
tacked [54].
4. Video: Video comprehension models have shown increasing capabilities on vision-
and-language tasks [339], but such models are also vulnerable to attacks [318].
5. Cybersecurity2 : The frst poisoning attacks were discovered in cybersecurity for
worm signature generation (2006) [236] and spam email classifcation (2008) [222].
Since then, poisoning attacks have been shown for malware classifcation, malicious
PDF detection, and Android malicious app classifcation [257]. Evasion attacks
against the same data modalities have been proposed as well: malware classifca-
tion [84, 282], PDF malware classifcation [279, 325], and Android malicious app
detection [239]. Clements et al. [78] developed a mechanism for effective generation
of evasion attacks on small, weak routers in network intrusion detection. Poison-
ing unsupervised learning models has been shown for clustering used in malware
classifcation [29] and network traffc anomaly detection [249].
Industrial Control Systems (ICS) and Supervisory Control and Data Acquisition
(SCADA) systems are part of modern Critical Infrastructure (CI) such as power grids,
power plants (nuclear, fossil fuel, renewable energy), water treatment plants, oil re-
fneries, etc. ICS are an attractive target for adversaries because of the potential for
highly consequential disruptions of CI [55, 167]. The existence of targeted stealth
attacks has led to the development of defense-in-depth mechanisms for their detec-
tion and mitigation. Anomaly detection based on data-centric approaches allows
automated feature learning through ML algorithms. However, the application of ML
to such problems comes with specifc challenges related to the need for a very low
false negative and low false positive rates, ability to catch zero-day attacks, account
for plant operational drift, etc. This challenge is compounded by the fact that try-
ing to accommodate all these together makes ML models susceptible to adversarial
attacks [161, 243, 353].
6. Tabular data: Numerous attacks against ML models working on tabular data in f-
nance, business, and healthcare applications have been demonstrated. For example,
poisoning availability attacks have been shown against healthcare and business ap-
2 Strictly
speaking, cybersecurity data may not include a single modality, but rather multiple modalities such
as network-level, host-level, or program-level data.
12
NIST AI 100-2e2023
January 2024
plications [143]; privacy attacks have been shown against healthcare data [333]; and
evasion attacks have been shown against fnancial applications [117].
Recently, the use of ML models trained on multimodal data has gained traction, particu-
larly the combination of image and text data modalities. Several papers have shown that
multimodal models may provide some resilience against attacks [328], but other papers
show that multimodal models themselves could be vulnerable to attacks mounted on all
modalities at the same time [63, 261, 326]. See Section 4.6 for additional discussion.
13
NIST AI 100-2e2023
January 2024
14
NIST AI 100-2e2023
January 2024
ing time); certifed techniques, such as randomized smoothing [79] (evaluating ML predic-
tion under noise); and formal verifcation techniques [112, 154] (applying formal method
techniques to verify the model’s output). Nevertheless, these methods come with different
limitations, such as decreased accuracy for adversarial training and randomized smoothing,
and computational complexity for formal methods. There is an inherent trade-off between
robustness and accuracy [296, 301, 342]. Similarly, there are trade-offs between a model’s
robustness and fairness guarantees [59].
This section discusses white-box and black-box evasion attack techniques, attack transfer-
ability, and the potential mitigation of adversarial examples in more detail.
15
NIST AI 100-2e2023
January 2024
1. DeepFool is an untargeted evasion attack for ℓ2 norms, which uses a linear approxi-
mation of the neural network to construct the adversarial examples [212].
2. The Carlini-Wagner attack uses multiple objectives that minimize the loss or logits
on the target class and the distance between the adversarial example and original
sample. The attack is optimized via the penalty method [53] and considers three
distance metrics to measure the perturbations of adversarial examples: ℓ0 , ℓ2 , and ℓ∞ .
The attack has been effective against the defensive distillation defense [234].
3. The Projected Gradient Descent (PGD) attack [194] minimizes the loss function and
projects the adversarial examples to the space of allowed perturbations at each iter-
ation of gradient descent. PGD can be applied to the ℓ2 and ℓ∞ distance metrics for
measuring the perturbation of adversarial examples.
Universal evasion attacks. Moosavi-Dezfooli et al. [211] showed how to construct small
universal perturbations (with respect to some norm), which can be added to most images
and induce a misclassifcation. Their technique relies on successive optimization of the
universal perturbation using a set of points sampled from the data distribution. This is a
form of F UNCTIONAL ATTACKS. An interesting observation is that the universal pertur-
bations generalize across deep network architectures, suggesting similarity in the decision
boundaries trained by different models for the same task.
Physically realizable attacks. These are attacks against machine learning systems that
become feasible in the physical world [11, 163, 189]. One of the frst physically realizable
attacks in the literature is the attack on facial recognition systems by Sharif et al. [260].
The attack can be realized by printing a pair of eyeglass frames, which misleads facial
recognition systems to either evade detection or impersonate another individual. Eykholt
et al. [100] proposed an attack to generate robust perturbations under different conditions,
resulting in adversarial examples that can evade vision classifers in various physical en-
vironments. The attack is applied to evade a road sign detection classifer by physically
applying black and white stickers to the road signs.
The ShapeShifter [67] attack is designed to evade object detectors, which is a more chal-
lenging problem than attacking image classifers since the attacker needs to evade the clas-
sifcation in multiple bounding boxes with different scales. In addition, this attack requires
the perturbation to be robust enough to survive real-world distortions due to different view-
ing distances and angles, lighting conditions, and camera limitations.
Other data modalities. In computer vision applications, adversarial examples must be im-
perceptible to humans. Therefore, the perturbations introduced by attackers need to be so
small that a human correctly recognizes the images, while the ML classifer is tricked into
changing its prediction. Alternatively, there may be a trigger object in the image that is still
imperceptible to humans but causes the model to misclassify. The concept of adversarial
examples has been extended to other domains, such as audio, video, natural language pro-
cessing (NLP), and cybersecurity. In some of these settings, there are additional constraints
16
NIST AI 100-2e2023
January 2024
that need to be respected by adversarial examples, such as text semantics in NLP and the
application constraints in cybersecurity. Several representative works are discussed below:
• Audio: Carlini and Wagner [54] showed a targeted attack on models that generate
text from speech. They can generate an audio waveform that is very similar to an
existing one but that can be transcribed to any text of the attacker’s choice.
• Video: Adversarial evasion attacks against video classifcation models can be split
into sparse attacks that perturb a small number of video frames [317] and dense
attacks that perturb all of the frames in a video [177]. The goal of the attacker is to
change the classifcation label of the video.
• NLP: Jia and Liang [149] developed a methodology for generating adversarial NLP
examples. This pioneering work was followed by many advances in developing ad-
versarial attacks on NLP models (see a comprehensive survey on the topic [347]).
Recently, La Malfa and Kwiatkowska [164] proposed a method for formalizing per-
turbation defnitions in NLP by introducing the concept of semantic robustness. The
main challenges in NLP are that the domain is discrete rather than continuous (e.g.,
image, audio, and video classifcation), and adversarial examples need to respect text
semantics.
• Cybersecurity: In cybersecurity applications, adversarial examples must respect the
constraints imposed by the application semantics and feature representation of cyber
data, such as network traffc or program binaries. FENCE is a general framework for
crafting white-box evasion attacks using gradient optimization in discrete domains
and supports a range of linear and statistical feature dependencies [73]. FENCE
has been applied to two network security applications: malicious domain detection
and malicious network traffc classifcation. Sheatsley et al. [262] propose a method
that learns the constraints in feature space using formal logic and crafts adversar-
ial examples by projecting them onto a constraint-compliant space. They apply the
technique to network intrusion detection and phishing classifers. Both papers ob-
serve that attacks from continuous domains cannot be readily applied in constrained
environments, as they result in infeasible adversarial examples. Pierazzi et al. [239]
discuss the diffculty of mounting feasible evasion attacks in cyber security due to
constraints in feature space and the challenge of mapping attacks from feature space
to problem space. They formalize evasion attacks in problem space and construct
feasible adversarial examples for Android malware.
17
NIST AI 100-2e2023
January 2024
vice (MLaaS) offered by public cloud providers, in which users can obtain the model’s pre-
dictions on selected queries without information about how the model was trained. There
are two main classes of black-box evasion attacks in the literature:
• Score-based attacks: In this setting, attackers obtain the model’s confdence scores
or logits and can use various optimization techniques to create the adversarial exam-
ples. A popular method is zeroth-order optimization, which estimates the model’s
gradients without explicitly computing derivatives [66, 137]. Other optimization
techniques include discrete optimization [210], natural evolution strategies [136],
and random walks [216].
• Decision-based attacks: In this more restrictive setting, attackers obtain only the
fnal predicted labels of the model. The frst method for generating evasion attacks
was the Boundary Attack based on random walks along the decision boundary and
rejection sampling [35], which was extended with an improved gradient estimation to
reduce the number of queries in the HopSkipJumpAttack [65]. More recently, several
optimization methods search for the direction of the nearest decision boundary (the
OPT attack [71]), use sign SGD instead of binary searches (the Sign-OPT attack
[72]), or use Bayesian optimization [271].
18
NIST AI 100-2e2023
January 2024
age transformations that occur in the real world, such as angle and viewpoint changes [11].
2.2.4. Mitigations
Mitigating evasion attacks is challenging because adversarial examples are widespread in
a variety of ML model architectures and application domains, as discussed above. Pos-
sible explanations for the existence of adversarial examples are that ML models rely on
non-robust features that are not aligned with human perception in the computer vision do-
main [138]. In the last few years, many of the proposed mitigations against adversarial
examples have been ineffective against stronger attacks. Furthermore, several papers have
performed extensive evaluations and defeated a large number of proposed mitigations:
• Carlini and Wagner showed how to bypass 10 methods for detecting adversarial ex-
amples and described several guidelines for evaluating defenses [52]. Recent work
shows that detecting adversarial examples is as diffcult as building a defense [295].
Therefore, this direction for mitigating adversarial examples is similarly challenging
when designing defenses.
• The Obfuscated Gradients attack [10] was specifcally designed to defeat several pro-
posed defenses that mask the gradients using the ℓ0 and ℓ∞ distance metrics. It relies
on a new technique, Backward Pass Differentiable Approximation, which approxi-
mates the gradient during the backward pass of backpropagation. It bypasses seven
proposed defenses.
` et al. [297] described a methodology for designing adaptive attacks against
• Tramer
proposed defenses and circumvented 13 existing defenses. They advocate design-
ing adaptive attacks to test newly proposed defenses rather than merely testing the
defenses against well-known attacks.
From the wide range of proposed defenses against adversarial evasion attacks, three main
classes have proved resilient and have the potential to provide mitigation against evasion
attacks:
1. Adversarial training: Introduced by Goodfellow et al. [120] and further developed
by Madry et al. [194], adversarial training is a general method that augments the
training data with adversarial examples generated iteratively during training using
their correct labels. The stronger the adversarial attacks for generating adversarial
examples are, the more resilient the trained model becomes. Interestingly, adversarial
training results in models with more semantic meaning than standard models [301],
but this beneft usually comes at the cost of decreased model accuracy on clean data.
Additionally, adversarial training is expensive due to the iterative generation of ad-
versarial examples during training.
2. Randomized smoothing: Proposed by Lecuyer et al. [169] and further improved by
Cohen et al. [79], randomized smoothing is a method that transforms any classifer
19
NIST AI 100-2e2023
January 2024
into a certifable robust smooth classifer by producing the most likely predictions
under Gaussian noise perturbations. This method results in provable robustness for ℓ2
evasion attacks, even for classifers trained on large-scale datasets, such as ImageNet.
Randomized smoothing typically provides certifed prediction to a subset of testing
samples (the exact number depends on the radius of the ℓ2 ball and the characteristics
of the training data and model). Recent results have extended the notion of certifed
adversarial robustness to ℓ2 -norm bounded perturbations by combining a pretrained
denoising diffusion probabilistic model and a standard high-accuracy classifer [50].
3. Formal verifcation: Another method for certifying the adversarial robustness of
a neural network is based on techniques from FORMAL METHODS. Reluplex uses
satisfability modulo theories (SMT) solvers to verify the robustness of small feed-
forward neural networks [154]. AI2 is the frst verifcation method applicable to
convolutional neural networks using abstract interpretation techniques [112]. These
methods have been extended and scaled up to larger networks in follow-up verifca-
tion systems, such as DeepPoly [274], ReluVal [313], and Fast Geometric Projections
(FGP) [108]. Formal verifcation techniques have signifcant potential for certifying
neural network robustness, but their main limitations are their lack of scalability,
computational cost, and restriction in the type of supported operations.
All of these proposed mitigations exhibit inherent trade-offs between robustness and accu-
racy, and they come with additional computational costs during training. Therefore, design-
ing ML models that resist evasion while maintaining accuracy remains an open problem.
20
NIST AI 100-2e2023
January 2024
21
NIST AI 100-2e2023
January 2024
for the detection of cybersecurity attacks targeting ICS. Such detectors are often retrained
using data collected during system operation to account for plant operational drift of the
monitored signals. This retraining procedure creates opportunities for an attacker to mimic
the signals of corrupted sensors at training time and poison the learning process of the
detector such that attacks remain undetected at deployment time [161].
A simple black-box poisoning attack strategy is LABEL FLIPPING, which generates train-
ing examples with a victim label selected by the adversary [27]. This method requires a
large percentage of poisoning samples for mounting an availability attack, and it has been
improved via optimization-based poisoning attacks introduced for the frst time against
S UPPORT V ECTOR M ACHINES (SVM) [28]. In this approach, the attacker solves a bilevel
optimization problem to determine the optimal poisoning samples that will achieve the
adversarial objective (i.e., maximize the hinge loss for SVM [28] or maximize the mean
square error [MSE] for regression [143]). These optimization-based poisoning attacks have
been subsequently designed against linear regression [143] and neural networks [215], and
they require white-box access to the model and training data. In gray-box adversarial set-
tings, the most popular method for generating availability poisoning attacks is transferabil-
ity, in which poisoning samples are generated for a surrogate model and transferred to the
target model [85, 283].
A realistic threat model for supervised learning is that of clean-label poisoning attacks in
which adversaries can only control the training examples but not their labels. This case
models scenarios in which the labeling process is external to the training algorithm, as
in malware classifcation where binary fles can be submitted by attackers to threat intel-
ligence platforms, and labeling is performed using anti-virus signatures or other external
methods. Clean-label availability attacks have been introduced for neural network clas-
sifers by training a generative model and adding noise to training samples to maximize
the adversarial objective [105]. A different approach for clean-label poisoning is to use
gradient alignment and minimally modify the training data [106].
Availability poisoning attacks have also been designed for unsupervised learning against
centroid-based anomaly detection [159] and behavioral clustering for malware [29]. In
federated learning, an adversary can mount a model poisoning attack to induce availability
violations in the globally trained model [101, 263, 264]. More details on model poisoning
attacks are provided in Section 2.3.4.
Mitigations. Availability poisoning attacks are usually detectable by monitoring the stan-
dard performance metrics of ML models – such as precision, recall, accuracy, F1 scores,
and area under the curve – as they cause a large degradation in the classifer metrics. Nev-
ertheless, detecting these attacks during the testing or deployment stages of ML is less
desirable, and existing mitigations aim to proactively prevent these attacks during the train-
ing stage to generate robust ML models. Among the existing mitigations, some generally
promising techniques include:
• Training data sanitization: These methods leverage the insight that poisoned sam-
22
NIST AI 100-2e2023
January 2024
ples are typically different than regular training samples not controlled by adver-
saries. As such, data sanitization techniques are designed to clean the training set
and remove the poisoned samples before the machine learning training is performed.
Nelson et al. [222] propose the Region of Non-Interest (RONI) method, which ex-
amines each sample and excludes it from training if the accuracy of the model de-
creases when the sample is added. Subsequently proposed sanitization methods im-
proved upon this early approach by reducing its computational complexity. Paudice
et al. [235] introduced a method for label cleaning that was specifcally designed
for label fipping attacks. Steinhardt et al. [280] propose the use of outlier detection
methods for identifying poisoned samples. Clustering methods have also been used
for detecting poisoned samples [165, 288]. In the context of network intrusion de-
tection, computing the variance of predictions made by an ensemble of multiple ML
models has proven to be an effective data sanitization method [305]. Once sanitized,
the datasets should be protected by cybersecurity mechanisms for provenance and
integrity attestation [220].
• Robust training: An alternative approach to mitigating availability poisoning at-
tacks is to modify the ML training algorithm and perform robust training instead of
regular training. The defender can train an ensemble of multiple models and generate
predictions via model voting [25, 172, 314]. Several papers apply techniques from
robust optimization, such as using a trimmed loss function [88, 143]. Rosenfeld et
al. [248] proposed the use of randomized smoothing for adding noise during training
and obtaining certifcation against label fipping attacks.
23
NIST AI 100-2e2023
January 2024
algorithm to optimize the poisoned samples, while Witches’ Brew [113] performs opti-
mization by gradient alignment, resulting in a state-of-the-art targeted poisoning attack.
All of the above attacks impact a small set of targeted samples that are selected by the
attacker during training, and they have only been tested for continuous image datasets
(with the exception of StingRay, which requires adversarial control of a large fraction of the
training set). Subpopulation poisoning attacks [144] were designed to poison samples from
an entire subpopulation, defned by matching on a subset of features or creating clusters
in representation space. Poisoned samples are generated using label fipping (for NLP
and tabular modalities) or a frst-order optimization method (for continuous data, such as
images). The attack generalizes to all samples in a subpopulation and requires minimal
knowledge about the ML model and a small number of poisoned samples (proportional to
the subpopulation size).
Targeted poisoning attacks have also been introduced for semi-supervised learning algo-
rithms [42], such as MixMatch [22], FixMatch [275], and Unsupervised Data Augmenta-
tion (UDA) [324] in which the adversary poisons a small fraction of the unlabeled training
dataset to change the prediction on targeted samples at deployment time.
Mitigations. Targeted poisoning attacks are notoriously challenging to defend against.
Jagielski et al. [144] showed an impossibility result for subpopulation poisoning attacks.
To mitigate some of the risks associated with such attacks, cybersecurity mechanisms
for dataset provenance and integrity attestation [220] should be used judiciously. Ma et
al. [192] proposed the use of differential privacy (DP) as a defense (which follows directly
from the defnition of differential privacy), but it is well known that differentially private
ML models have lower accuracy than standard models. The trade-off between robustness
and accuracy needs to be considered in each application. If the application has strong data
privacy requirements, and differentially private training is used for privacy, then an ad-
ditional beneft is protection against targeted poisoning attacks. However, the robustness
offered by DP starts to fade once the targeted attack requires multiple poisoning samples
(as in subpopulation poisoning attacks) because the group privacy bound will not provide
meaningful guarantees for large poisoned sets.
24
NIST AI 100-2e2023
January 2024
more realistic.
In the last few years, backdoor attacks have become more sophisticated and stealthy, mak-
ing them harder to detect and mitigate. Latent backdoor attacks were designed to survive
even upon model fne-tuning of the last few layers using clean data [331]. Backdoor Gener-
ating Network (BaN) [253] is a dynamic backdoor attack in which the location of the trigger
changes in the poisoned samples so that the model learns the trigger in a location-invariant
manner. Functional triggers, a.k.a. F UNCTIONAL ATTACKS, are embedded throughout the
image or change according to the input. For instance, Li et al. [176] used steganography
algorithms to hide the trigger in the training data. Liu et al. [186] introduced a clean-label
attack that uses natural refection on images as a backdoor trigger. Wenger et al. [320] poi-
soned facial recognition systems by using physical objects as triggers, such as sunglasses
and earrings.
Other data modalities. While the majority of backdoor poisoning attacks are designed
for computer vision applications, this attack vector has been effective in other application
domains with different data modalities, such as audio, NLP, and cybersecurity settings.
• Audio: In audio domains, Shi et al. [268] showed how an adversary can inject an
unnoticeable audio trigger into live speech, which is jointly optimized with the target
model during training.
• NLP: In natural language processing, the construction of meaningful poisoning sam-
ples is more challenging as the text data is discrete, and the semantic meaning of
sentences would ideally be preserved for the attack to remain unnoticeable. Recent
work has shown that backdoor attacks in NLP domains are becoming feasible. For
instance, Chen et al. [68] introduced semantic-preserving backdoors at the charac-
ter, word, and sentence level for sentiment analysis and neural machine translation
applications. Li et al. [175] generated hidden backdoors against transformer mod-
els using generative language models in three NLP tasks: toxic comment detection,
neural machine translation, and question answering.
• Cybersecurity: Early poisoning attacks in cybersecurity were designed against worm
signature generation in 2006 [236] and spam detectors in 2008 [222], well before
rising interest in adversarial machine learning. More recently, Severi et al. [257]
showed how AI explainability techniques can be leveraged to generate clean-label
poisoning attacks with small triggers against malware classifers. They attacked mul-
tiple models (i.e., neural networks, gradient boosting, random forests, and SVMs),
using three malware datasets: Ember for Windows PE fle classifcation, Contagio
for PDF fle classifcation, and DREBIN for Android app classifcation. Jigsaw Puz-
zle [329] designed a backdoor poisoning attack for Android malware classifers that
uses realizable software triggers harvested from benign code.
Mitigations. The literature on backdoor attack mitigation is vast compared to other poi-
soning attacks. Below we discuss several classes of defenses, including data sanitization,
25
NIST AI 100-2e2023
January 2024
trigger reconstruction, model inspection and sanitization, and also their limitations.
• Training Data Sanitization: Similar to poisoning availability attacks, training data
sanitization can be applied to detecting backdoor poisoning attacks. For instance,
outlier detection in the latent feature space [129, 238, 300] has been effective for con-
volutional neural networks used for computer vision applications. Activation Clus-
tering [62] performs clustering of training data in representation space with the goal
of isolating the backdoored samples in a separate cluster. Data sanitization achieves
better results when the poisoning attack controls a relatively large fraction of training
data, but is not that effective against stealthy poisoning attacks. Overall, this leads to
a trade-off between attack success and detectability of malicious samples.
• Trigger reconstruction: This class of mitigations aims to reconstruct the backdoor
trigger, assuming that it is at a fxed location in the poisoned training samples. Neu-
ralCleanse by Wang et al. [310] developed the frst trigger reconstruction approach
and used optimization to determine the most likely backdoor pattern that reliably
misclassifes the test samples. The initial technique has been improved to reduce
performance time on several classes and simultaneously support multiple triggers in-
serted into the model [131, 322]. A representative system in this class is Artifcial
Brain Simulation (ABS) by Liu et al. [184], which stimulates multiple neurons and
measures the activations to reconstruct the trigger patterns. Khaddaj et al. [156] de-
veloped a new primitive for detecting backdoor attacks and a corresponding effective
detection algorithm with theoretical guarantees.
• Model inspection and sanitization: Model inspection analyzes the trained ML
model before its deployment to determine whether it was poisoned. An early work in
this space is NeuronInspect [134], which is based on explainability methods to deter-
mine different features between clean and backdoored models that are subsequently
used for outlier detection. DeepInspect [64] uses a conditional generative model to
learn the probability distribution of trigger patterns and performs model patching
to remove the trigger. Xu et al. [327] proposed the Meta Neural Trojan Detection
(MNTD) framework, which trains a meta-classifer to predict whether a given ML
model is backdoored (or Trojaned, in the authors’ terminology). This technique is
general and can be applied to multiple data modalities, such as vision, speech, tabular
data, and NLP. Once a backdoor is detected, model sanitization can be performed via
pruning [321], retraining [340], or fne-tuning [180] to restore the model’s accuracy.
Most of these mitigations have been designed against computer vision classifers based
on convolutional neural networks using backdoors with fxed trigger patterns. Severi et
al. [257] showed that some of the data sanitization techniques (e.g., spectral signatures [300]
and Activation Clustering [62]) are ineffective against clean-label backdoor poisoning on
malware classifers. Most recent semantic and functional backdoor triggers would also
pose challenges to approaches based on trigger reconstruction or model inspection, which
generally assume fxed backdoor patterns. The limitation of using meta classifers for pre-
26
NIST AI 100-2e2023
January 2024
dicting a Trojaned model [327] is the high computational complexity of the training stage
of the meta classifer, which requires training thousands of SHADOW MODELS. Additional
research is required to design strong backdoor mitigation strategies that can protect ML
models against this important attack vector without suffering from these limitations.
In cybersecurity, Rubinstein et al. [249] proposed a principal component analysis (PCA)-
based approach to mitigate poisoning attacks against PCA subspace anomaly detection
method in backbone networks. It maximized Median Absolute Deviation (MAD) instead
of variance to compute principal components, and used a threshold value based on Laplace
distribution instead of Gaussian. Madani and Vlajic [193] built an autoencoder-based in-
trusion detection system, assuming malicious poisoning attack instances were under 2%.
A recent paper [156] provides a different perspective on backdoor mitigation, by showing
that backdoors are indistinguishable from naturally occurring features in the data, if no
additional assumptions are made about the attack. However, assuming that the backdoor
creates the strongest feature in the data, the paper proposes an optimization technique to
identify and remove the training samples corresponding to the backdoor.
To complement existing mitigations that are not always resilient in face of evolving attacks,
poison forensics [259] is a technique for root cause analysis that identifes the malicious
training samples. Poison forensics adds another layer of defense in an ML system: Once a
poisoning attack is detected at deployment time, poison forensics can trace back the source
of attack in the training set.
27
NIST AI 100-2e2023
January 2024
to induce the misclassifcation of all samples with the trigger at testing time [13, 23,
285, 312]. Most of these backdoors are forgotten if the compromised clients do not
regularly participate in training, but the backdoor becomes more durable if injected
in the lowest utilized model parameters [349].
Model poisoning attacks are also possible in supply-chain scenarios where models or com-
ponents of the model provided by suppliers are poisoned with malicious code. A recent
supply-chain attack, Dropout Attack [336], shows how an adversary who manipulates the
randomness used in neural network training (in particular in dropout regularization), might
poison the model to decrease accuracy, precision, or recall on a set of targeted classes.
Mitigations. To defend federated learning from model poisoning attacks, a variety of
Byzantine-resilient aggregation rules have been designed and evaluated. Most of them
attempt to identify and exclude the malicious updates when performing the aggregation at
the server [3, 31, 40, 125, 203–205, 284, 334]. However, motivated adversaries can bypass
these defenses by adding constraints in the attack generation optimization problem [13,
101, 263]. Gradient clipping and differential privacy have the potential to mitigate model
poisoning attacks to some extent [13, 225, 285], but they usually decrease accuracy and do
not provide complete mitigation.
For specifc model poisoning vulnerabilities, such as backdoor attacks, there are some tech-
niques for model inspection and sanitization, as discussed in Section 2.3.3. However, miti-
gating supply-chain attacks in which adversaries might control the source code of the train-
ing algorithm or the ML hyperparameters, remains challenging. Program verifcation tech-
niques used in other domains (such as cryptographic protocol verifcation [241]) might be
adapted to this setting, but ML algorithms have intrinsic randomness and non-deterministic
behavior, which enhances the diffculty of verifcation.
28
NIST AI 100-2e2023
January 2024
29
NIST AI 100-2e2023
January 2024
in neural networks. This work has been recently extended to reconstruct training samples
of multi-class multi-layer perceptron classifers [39]. In another relevant privacy attack,
attribute inference, the attacker extracts a sensitive attribute of the training set, assuming
partial knowledge about other features in the training data [147].
The ability to reconstruct training samples is partially explained by the tendency of neural
networks to memorize their training data. Zhang et al. [341] discussed how neural networks
can memorize randomly selected datasets. Feldman [103] showed that the memorization of
training labels is necessary to achieving almost optimal generalization error in ML. Brown
et al. [36] constructed two learning tasks based on next-symbol prediction and cluster la-
beling in which memorization is required for high-accuracy learning. Feldman and Zhang
empirically evaluated the beneft of memorization for generalization using an infuence es-
timation method [104]. We will discuss data reconstruction attacks and their connection to
memorization for generative AI in Section 3.3.1.
30
NIST AI 100-2e2023
January 2024
(selected as the average loss of training examples). Sablayrolles et al. [250] refned the loss-
based attack by scaling the loss using a per-example threshold. Another popular technique
introduced by Shokri et al. [269] is that of shadow models, which trains a meta-classifer
on examples in and out of the training set obtained from training thousands of shadow ML
models on the same task as the original model. This technique is generally expensive, and
while it might improve upon the simple loss-based attack, its computational cost is high and
requires access to many samples from the distribution to train the shadow models. These
two techniques are at opposite ends of the spectrum in terms of their complexity, but they
perform similarly in terms of precision at low false positive rates [43].
An intermediary method that is obtains good performance in terms of the A REA U NDER
THE C URVE (AUC) metric is the LiRA attack by Carlini et al. [43], which trains a smaller
number of shadow models to learn the distribution of model logits on examples in and out
of the training set. Using the assumption that the model logit distributions are Gaussian,
LiRA performs a hypothesis test for membership inference by estimating the mean and
standard deviation of the Gaussian distributions. Ye et al. [332] designed a similar attack
that performs a one-sided hypothesis test, which does not make any assumptions on the
loss distribution but achieves slightly lower performance than LiRA. Recently, Lopez et
al. [187] propose a more effcient membership inference attack that requires training a
single model to predict the quantiles of the confdence score distribution of the model under
attack. Membership inference attacks have also been designed under the stricter label-only
threat model in which the adversary only has access to the predicted labels of the queried
samples [74].
There are several public privacy libraries that offer implementations of membership infer-
ence attacks: the TensorFlow Privacy library [278] and the ML Privacy Meter [214].
31
NIST AI 100-2e2023
January 2024
pute model weights algebraically [47, 141, 298]. A second technique explored in a series
of papers is to use learning methods for extraction. For instance, active learning [58] can
guide the queries to the ML model for more effcient extraction of model weights, and rein-
forcement learning can train an adaptive strategy that reduces the number of queries [231].
A third technique is the use of SIDE CHANNEL information for model extraction. Batina et
al. [18] used electromagnetic side channels to recover simple neural network models, while
Rakin et al. [245] showed how ROWHAMMER ATTACKS can be used for model extraction
of more complex convolutional neural network architectures.
Note that model extraction is often not an end goal but a step towards other attacks. As the
model weights and architecture become known, attackers can launch more powerful attacks
typical for the white-box or gray-box settings. Therefore, preventing model extraction can
mitigate downstream attacks that depend on the attacker having knowledge of the model
architecture and weights.
2.4.5. Mitigations
The discovery of reconstruction attacks against aggregate information motivated the rig-
orous defnition of differential privacy (DP) [92, 93]. Differential privacy is an extremely
strong defnition of privacy that guarantees a bound on how much an attacker with access
to the algorithm output can learn about each individual record in the dataset. The original
pure defnition of DP has a privacy parameter ε (i.e., privacy budget), which bounds the
32
NIST AI 100-2e2023
January 2024
probability that the attacker with access to the algorithm’s output can determine whether
a particular record was included in the dataset. DP has been extended to the notions of
approximate DP, which includes a second parameter δ that is interpreted as the probability
of information accidentally being leaked in addition to ε and Rènyi DP [208].
DP has been widely adopted due to several useful properties: group privacy (i.e., the exten-
sion of the defnition to two datasets differing in k records), post-processing (i.e., privacy
is preserved even after processing the output), and composition (i.e., privacy is composed
if multiple computations that are performed on the dataset). DP mechanisms for statisti-
cal computations include the Gaussian mechanism [93], the Laplace mechanism [93], and
the Exponential mechanism [198]. The most widely used DP algorithm for training ML
models is DP-SGD [1], with recent improvements such as DP-FTRL [151] and DP matrix
factorization [86].
By defnition, DP provides mitigation against data reconstruction and membership infer-
ence attacks. In fact, the defnition of DP immediately implies an upper bound on the
success of an adversary in mounting a membership inference attack. Tight bounds on the
success of membership inference have been derived by Thudi et al. [291]. However, DP
does not provide guarantees against model extraction attacks, as this method is designed
to protect the training data, not the model. Several papers reported negative results on us-
ing differential privacy to protect against property inference attacks which aim to extract
properties of subpopulations in the training set [61, 195].
One of the main challenges of using DP in practice is setting up the privacy parameters to
achieve a trade-off between the level of privacy and the achieved utility, which is typically
measured in terms of accuracy for ML models. Analysis of privacy-preserving algorithms,
such as DP-SGD, is often worst case and not tight, and selecting privacy parameters based
purely on theoretical analysis results in utility loss. Therefore, large privacy parameters are
often used in practice (e.g., the 2020 U.S. Census release used ε = 19.61), and the exact
privacy obtained in practice is diffcult to estimate. Recently, a promising line of work is
that of privacy auditing introduced by Jagielski et al. [145] with the goal of empirically
measuring the actual privacy guarantees of an algorithm and determining privacy lower
bounds by mounting privacy attacks. Auditing can be performed with membership infer-
ence attacks [146, 338], but poisoning attacks are much more effective and result in better
estimates of the privacy leakage [145, 219]. Recent advances in privacy auditing include
tighter bounds for the Gaussian mechanism [217], as well as rigorous statistical methods
that allow the use of multiple canaries to reduce the sample complexity of auditing [240].
Additionally, two effcient methods for privacy auditing with training a single model have
been proposed: Steinke et al. [281] use multiple random data canaries without incurring
the cost of group privacy; and Andrew et al. [4] use multiple random client canaries and a
cosine similarity test statistics to audit user-level private federated learning.
33
NIST AI 100-2e2023
January 2024
Other mitigation techniques against model extraction, such as limiting user queries to the
model, detecting suspicious queries to the model, or creating more robust architectures to
prevent side channel attacks exist in the literature. However, these techniques can be cir-
cumvented by motivated and well-resourced attackers and should be used with caution.
We refer the reader to available practice guides for securing machine learning deploy-
ments [57, 226]. A completely different approach to potentially mitigating privacy leakage
of a user’s data is to perform MACHINE UNLEARNING, a technique that enables a user to
request removal of their data from a trained ML model. Existing techniques for machine
unlearning are either exact (e.g., retraining the model from scratch or from a certain check-
point) [34, 41] or approximate (updating the model parameters to remove the infuence of
the unlearned records) [115, 139, 221].
34
NIST AI 100-2e2023
January 2024
3. Generative AI Taxonomy
35
NIST AI 100-2e2023
January 2024
Qu
Tr ontrol a
Increased
aini Dat
ery Access
Data Poisoning Availability Computation
ng
C
Prompt Injection
C
So ontrol ta
urce Da
Indirect Indirect
Misaligned
Prompt Injection Prompt Injection
Inputs
ource
Resn ource
Resn
Targeting Co trol Co trol
Poisoning Prompt Injection
Qu
Qu
Tr ontrol a
Tr ontrol a
aini Dat
aini Dat
ery Access
ery Access
Data Poisoning Integrity Abuse
ng
ng
C
C
Prompt Injection
C C
So ontrol ta So ontrol ta
urce Da urce Da
Backdoor
Poisoning Prompt Injection
Prompt Extraction
e Que
Backdoor rc ol ry
r
Consou
Poisoning
Ac
t
Re
Model
cess
Indirect Privacy
Prompt Injection
Quer
Data Information
ou trol
rce
yA
Gathering
n
Coes cc
ess
Unauthorized
R
Disclosure
Membership Inference
Training Data
Attacks Data Extraction
An attack can be further categorized by the learning stage to which it applies and, sub-
sequently, by the attacker’s knowledge and access. These are reviewed in the following
sections. Where possible, the discussion broadly applies to GenAI with some specifc
areas that apply to LLMs (e.g., R ETRIEVAL AUGMENTED G ENERATION [RAG], which
dominates many of the deployment stage attacks described below).
36
NIST AI 100-2e2023
January 2024
model encodes patterns (e.g., in text, images, etc.) that are useful for downstream tasks.
The foundation models themselves are then the basis for creating task-specifc applications
via fne-tuning. In many cases, application developers begin with a foundation model devel-
oped by a third party and fne-tune it for their specifc application. Attacks that correspond
to the stages of GenAI application development are described in more detail below.
Training-time attacks. The TRAINING STAGE for GenAI often consists of two distinct
stages: foundation model PRE - TRAINING and model FINE - TUNING. This pattern exists
for generative image models, text models, audio models, and multimodal models, among
others. Since foundation models are most effective when trained on large datasets, it has
become common to scrape data from a wide range of public sources. This makes founda-
tion models especially susceptible to POISONING ATTACKS, in which an adversary controls
a subset of the training data. Researchers have demonstrated that an attacker can induce
targeted failures in models by arbitrarily poisoning only 0.001% of uncurated web-scale
training datasets [42]. Executing web-scale dataset poisoning can be as simple as purchas-
ing a small fraction of expired domains from known data sources [46]. Model fne-tuning
may also be susceptible to poisoning attacks under the more common attacker knowledge
and capabilities outlined in Section 2.1.
Inference-time attacks. The DEPLOYMENT STAGE for GenAI also differs from PredAI.
How a model is used during deployment is application-specifc. However, underlying many
of the security vulnerabilities in LLMs and RAG applications is the fact that data and
instructions are not provided in separate channels to the LLM, which allows attackers to
use data channels to conduct inference-time attacks that are similar to decades-old SQL
injection. Acknowledging a particular emphasis on LLMs, specifcally for question-and-
answering and text-summarization tasks, many of the attacks in this stage are due to the
following practices that are common to applications of text-based generative models:
1. Alignment via model instructions: LLM behaviors are aligned at inference time
through instructions that are pre-pended to the model’s input and context. These in-
structions comprise a natural language description of the model’s application-specifc
use case (e.g., “You are a helpful fnancial assistant that responds gracefully and
concisely....”). A JAILBREAK overrides this explicit alignment and other safeguards.
Since these prompts have been carefully crafted through prompt engineering, a PROMPT
EXTRACTION attack may attempt to steal these system instructions. These attacks are
also relevant to multimodal and text-to-image models.
2. Contextual few-shot learning: Since LLMs are autoregressive predictors, their per-
formance in applications can be improved by providing examples of the inputs and
outputs expected for the application in the model’s context that is prepended to the
user query before evaluation by the LLM. This allows the model to more naturally
complete the autoregressive tasks [37].
3. Runtime data ingestion from third-party sources: As is typical in R ETRIEVAL
AUGMENTED G ENERATION applications, context is crafted at runtime in a query-
37
NIST AI 100-2e2023
January 2024
dependent way and populated from external data sources (e.g., documents, web
pages, etc.) that are to be summarized as part of the application. INDIRECT PROMPT
INJECTION attacks depend on the attacker’s ability to modify the context using out-
side sources of information that are ingested by the system, even if not directly by
the user.
4. Output handling: The output of an LLM may be used to populate an element on a
web-page or to construct a command.
5. Agents: Plugins, functions, agents, and other concepts all rely on processing the
output of the LLM (item 4) to perform some additional task and provide additional
context to its input (item 3). In some cases, the LLM selects from among an ap-
propriate set of these external dependencies based on a confguration provided in
natural language and invokes that code with templates flled out by the LLM using
information in the context.
input
LLM
user What was their operating
cash flow?
doc1: AMAZON.COM ANNOUNCES
assistant FIRST QUARTER RESULTS ...
doc2: Financials results
resources
ending Mar 31, 2020 ...
(e.g., vector db)
assistant
38
NIST AI 100-2e2023
January 2024
• SOURCE CODE CONTROL: The attacker might modify the source code of the ML al-
gorithm, such as the random number generator or any third-party libraries, which are
often open source. The advent of open-source model repositories, like HuggingFace,
allows attackers to create malicious models or wrap benign models with malicious
code embedded in the deserialization format.
• RESOURCE CONTROL: The attacker might modify resources (e.g., documents, web
pages) that will be ingested by the GenAI model at runtime. This capability is used
for INDIRECT PROMPT INJECTION attacks.
39
NIST AI 100-2e2023
January 2024
3.2.3. Mitigations
AI supply chain attacks can be mitigated by supply chain assurance practices. For model
fle dependencies, this includes regular vulnerability scanning of the model artifacts used in
the ML pipeline [292], and by adopting safe model persistence formats like safetensors.
For webscale data dependencies, this includes verifying web downloads by publishing (by
the provider) cryptographic hashes and verifying (by the downloader) training data as a ba-
sic integrity check to ensure that domain hijacking has not injected new sources of data into
the training dataset [46]. Another approach to mitigating risks associated with malicious
image editing by large diffusion models is immunizing images to make them resistant to
manipulation by these models [254]. However, this approach requires an additional policy
component to make it effective and practical.
40
NIST AI 100-2e2023
January 2024
Attacker techniques. Attacker techniques for launching direct prompt injection attacks
are numerous but tend to fall into several broad categories [319]:
• Gradient-based attacks are white-box optimization-based methods for designing
jailbreaks that are very similar to the PredAI attacks discussed in Section 2.2.1. A
gradient-based distributional attack uses an approximation to make the adversarial
loss for generative transformer models differentiable, which aims to minimize lex-
ical changes by enforcing perceptibility and fuency via BERTScore and perplexity
[127]. HotFlip encodes modifcations of text into a binary vector and gradient steps
to minimize adversarial loss [97]. Originally designed to create adversarial examples
for PredAI language classifers (e.g., sentiment analysis), subsequent works have
leveraged HotFlip for GenAI using the following trick: since these autoregressive
tokens generate a single token at a time, optimizing the frst generated token to pro-
duce an affrmative response is often suffcient to prime the autoregressive generative
process to complete a fully affrmative utterance [49]. Universal adversarial triggers
are a special class of these gradient-based attacks against generative models that seek
to fnd input-agnostic prefxes (or suffxes) that, when included, produce the desired
affrmative response regardless of the remainder of the input [308, 354]. That these
universal triggers transfer to other models makes open-source models — for which
there is ready white-box access — feasible attack vectors for transferability attacks
to closed systems where only API access is available [354].
• Manual methods for jailbreaking an LLM generally fall into two categories: com-
peting objectives and mismatched generalization [316]. These methods often exploit
the model’s susceptibility to certain linguistic manipulations and extend beyond con-
ventional adversarial inputs. In the category of competing objectives, additional in-
structions are provided that compete with the instructions originally provided by the
author.
1. Prefx injection: This method involves prompting the model to commence re-
sponses with an affrmative confrmation. By conditioning the model to begin
its output in a predetermined manner, adversaries attempt to infuence its subse-
quent language generation toward specifc, predetermined patterns or behaviors.
2. Refusal suppression: Adversaries provide explicit instructions to the model,
compelling it to avoid generating refusals or denials in its output. By limiting
or prohibiting the generation of negative responses, this tactic aims to ensure the
model’s compliance with the provided instructions, potentially compromising
safety measures.
3. Style injection: In this approach, adversaries instruct the model not to use long
words or adopt a particular style. By constraining the model’s language to sim-
plistic or non-professional tones, it aims to limit the sophistication or accuracy
of the model’s responses, thereby potentially compromising its overall perfor-
mance.
41
NIST AI 100-2e2023
January 2024
42
NIST AI 100-2e2023
January 2024
43
NIST AI 100-2e2023
January 2024
verbosely extracted by simply asking for them via direct prompt injection.
3.3.2. Mitigations
Various defense strategies have been proposed for prompt injection that provide a measure
of protection but not full immunity to all attacker techniques. These broadly fall into the
following categories:
1. Training for alignment. Model providers continue to create built-in mechanisms
by training with stricter forward alignment [148]. For example, model alignment
can be tuned by training on carefully curated and prealigned datasets. It can then be
iteratively improved through reinforcement learning with human feedback [123].
2. Prompt instruction and formatting techniques. LLM instructions can cue the
model to treat user input carefully [168, 182]. For example, by appending specifc
instructions to the prompt, the model can be informed about subsequent content that
may constitute a jailbreak. Positioning the user input before the prompt takes advan-
tage of recency bias in following instructions. Encapsulating the prompt in random
characters or special HTML tags provides cues to the model about what constitutes
system instructions versus user prompts.
3. Detection techniques. Model providers continue to create built-in mechanisms by
training with stricter backward alignment via evaluation on specially crafted bench-
mark datasets or flters that monitor the input to and output of a protected LLM
[148]. One proposed method is to evaluate a distinctly prompted LLM that can
aid in distinguishing potentially adversarial prompts [168, 182]. Several commer-
cial products have begun offering tools to detect prompt injection, both by detecting
potentially malicious user input and by moderating the output of the frewall for jail-
break behavior [8, 166, 247]. These may provide supplementary assurance through
a defense-in-depth philosophy.
Similarly, defenses for prompt stealing have yet to be proven rigorous. A commonality in
the methods is that they compare the model utterance to the prompt, which is known by
the system provider. Defenses differ in how this comparison is made, which might include
looking for a specifc token, word, or phrase, as popularized by [48], or comparing the
n-grams of the output to the input [348].
44
NIST AI 100-2e2023
January 2024
prompts without directly interacting with the RAG application [122]. As with direct prompt
injection, indirect prompt injection attacks can result in violations across the four categories
of attacker goals: 1) availability violations, 2) integrity violations, 3) privacy compromises,
and 4) abuse violations.
45
NIST AI 100-2e2023
January 2024
46
NIST AI 100-2e2023
January 2024
47
NIST AI 100-2e2023
January 2024
3.4.5. Mitigations
Various mitigation techniques have been proposed for indirect prompt injection attacks that
help eliminate model risk but — like the suggestions made for direct prompt injections —
do not offer full immunity to all attacker techniques. These mitigation strategies fall into
the following categories:
1. Reinforcement learning from human feedback (RLHF). RLHF is a type of AI model
training whereby human involvement is indirectly used to fne-tune a model. This
can be leveraged to better align LLMs with human values and prevent unwanted
behaviors [122]. OpenAI’s GPT-4 was fne-tuned using RLHF and has shown a
lesser tendency to produce harmful content or hallucinate [229].
48
NIST AI 100-2e2023
January 2024
49
NIST AI 100-2e2023
January 2024
50
NIST AI 100-2e2023
January 2024
Despite progress in the ability of chatbots to perform well on certain tasks [227],
this technology is still emerging and should only be deployed in applications that
require a high degree of trust in the information they generate with abundance of
caution and continuous monitoring.
As the development of AI-enabled chatbots continues and their deployment becomes more
prevalent online and in business applications, these concerns will come to the forefront
and be pursued by adversaries to discover and exploit vulnerabilities and by companies
developing the technology to improve their design and implementation to protect against
such attacks [354]. The identifcation and mitigation of a variety of risk factors, such as
vulnerabilities, include R ED T EAMING [56, 109] as part of pre-deployment testing and
evaluation of LLM’s. These processes vary and have included testing for traditional cy-
bersecurity vulnerabilities, bias, and discrimination, generation of harmful content, privacy
violations, and novel or emergent characteristics of large-scale models, as well as evalua-
tions of larger societal impacts such as economic impacts, the perpetuation of stereotypes,
long-term over-reliance, and erosion of democratic norms [157, 267].
Realistic risk management throughout the entire life cycle of the technology is critically
important to identify risks and plan early corresponding mitigation approaches [226]. For
example, incorporating human adversarial input in the process of training the system (i.e.,
R ED T EAMING) or employing reinforcement learning from human feedback appear to of-
51
NIST AI 100-2e2023
January 2024
fer benefts in terms of making the chatbot more resilient against toxic input or prompt
injections [83]. However, adapting chatbots to downstream use cases often involves the
customization of the pre-trained LLM through further fne-tuning, which introduces new
safety risks that may degrade the safety alignment of the LLM [242]. Barrett et al. [17]
have developed detailed risk profles for cutting-edge generative AI systems that map well
to the NIST AI RMF [226] and should be used for assessing and mitigating potentially
catastrophic risks to society that may arise from this technology. There are also useful
industry resources for managing foundation model risks [32].
The robust training techniques discussed in Section 2.3 offer different approaches to pro-
viding theoretically certifed defenses against data poisoning attacks with the intention of
providing much-needed information-theoretic guarantees for security. The results are en-
couraging, but more research is needed to extend this methodology to more general as-
sumptions about the data distributions, the ability to handle OOD inputs, more complex
models, multiple data modalities, and better performance. Another challenge is applying
these techniques to very large models like LLMs and generative diffusion models, which
are quickly becoming targets of attacks [44, 75].
Another general problem of AML mitigations for both evasion and poisoning attacks is
the lack of reliable benchmarks which causes results from AML papers to be routinely
incomparable, as they do not rely on the same assumptions and methods. While there
have been some promising developments into this direction [81, 256], more research and
encouragement is needed to foster the creation of standardized benchmarks to allow gaining
reliable insights into the actual performance of proposed mitigations.
Formal methods verifcation has a long history in other felds where high assurance is re-
quired, such as avionics and cryptography. The lessons learned there teach us that although
the results from applying this methodology are excellent in terms of security and safety
assurances, they come at a very high cost, which has prevented formal methods from being
widely adopted. Currently, formal methods in these felds are primarily used in applications
mandated by regulations. Applying formal methods to neural networks has signifcant po-
tential to provide much-needed security guarantees, especially in high-risk applications.
However, the viability of this technology will be determined by a combination of techni-
cal and business criteria – namely, the ability to handle today’s complex machine learning
models of interest at acceptable costs. More research is needed to extend this technology
to all algebraic operations used in machine learning algorithms, to scale it up to the large
models used today, and to accommodate rapid changes in the code of AI systems while
limiting the costs of applying formal verifcation.
There is an imbalance between the large number of privacy attacks listed in Section 2.4
(i.e., memorization, membership inference, model extraction, and property inference) and
available reliable mitigation techniques. In some sense, this is a normal state of affairs: a
rapidly evolving technology gaining widespread adoption – even “hype” – which attracts
the attention of adversaries, who try to expose and exploit its weaknesses before the tech-
52
NIST AI 100-2e2023
January 2024
nology has matured enough for society to assess and manage it effectively. To be sure, not
all adversaries have malevolent intent. Some simply want to warn the public of potential
breakdowns that can cause harm and erode trust in the technology. Additionally, not all
attacks are as practical as they need to be to pose real threats to AI system deployments
of interest. Yet the race between developers and adversaries has begun, and both sides
are making great progress. This poses many diffcult questions for the AI community of
stakeholders, such as:
• What is the best way to mitigate the potential exploits of memorized data from Sec-
tion 3.3.1 as models grow and ingest larger amounts of data?
• What is the best way to prevent attackers from inferring membership in the training
set or other properties of the training data using the attacks listed in Sections 2.4.2
and 2.4.4?
• How can developers protect their ML models with the secret weights and associated
intellectual property from the emerging threats in the PredAI and GenAI spaces?
Especially, attacks that utilize the public API of the ML model to query and exploit
its secret weights or the side-channel leakage attacks from Section 2.4.3? The known
mechanisms of preventing large numbers of queries through the API are ineffective
in confgurations with anonymous or unauthenticated access to the model.
As answers to these questions become available, it is important for the community of stake-
holders to develop specifc guidelines to complement the NIST AI RMF [226] for use cases
where privacy is of utmost importance.
53
NIST AI 100-2e2023
January 2024
strong cryptographic algorithms publicly available and widely used. In another example,
in bio engineering, society has determined that the risks of uncontrolled genetic engineering
are too great to allow open access to the technology.
The open vs. closed model dilemma in AI is being actively debated in the community of
stakeholders and should be resolved before models become too powerful and make it moot.
The full characterization of the trade-offs between the different attributes of trust-
worthy AI is still an open research problem that is gaining increasing importance
with the adoption of AI technology in many areas of modern life.
54
NIST AI 100-2e2023
January 2024
In most cases, organizations will need to accept trade-offs between these properties and
decide which of them to prioritize depending on the AI system, the use case, and potentially
many other considerations about the economic, environmental, social, cultural, political,
and global implications of the AI technology [226].
55
NIST AI 100-2e2023
January 2024
for PredAI models exist in the literature [179]. The effects of quantization on GenAI mod-
els has been studies lees and organizations deploying such models should be careful to
continuously monitor their behavior.
56
NIST AI 100-2e2023
January 2024
References
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov,
Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In ACM Con-
ference on Computer and Communications Security, CCS ’16, pages 308–318, 2016.
https://arxiv.org/abs/1607.00133.
[2] Hojjat Aghakhani, Dongyu Meng, Yu-Xiang Wang, Christopher Kruegel, and Gio-
vanni Vigna. Bullseye polytope: A scalable clean-label poisoning attack with im-
proved transferability. In IEEE European Symposium on Security and Privacy, 2021,
Vienna, Austria, September 6-10, 2021, pages 159–178. IEEE, 2021.
[3] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine Stochastic Gradient De-
scent. In NeurIPS, 2018.
[4] Galen Andrew, Peter Kairouz, Sewoong Oh, Alina Oprea, H. Brendan McMahan,
and Vinith Suriyakumar. One-shot empirical privacy estimation for federated learn-
ing, 2023.
[5] Anthropic. Anthropic acceptable use policy, 2023.
[6] Anthropic. Model Card and Evaluations for Claude Models. https://www-fles.ant
hropic.com/production/images/Model-Card-Claude-2.pdf, July 2023. Anthropic.
[7] Giovanni Apruzzese, Hyrum S Anderson, Savino Dambra, David Freeman, Fabio
Pierazzi, and Kevin Roundy. “real attackers don’t compute gradients”: Bridging
the gap between adversarial ml research and practice. In 2023 IEEE Conference on
Secure and Trustworthy Machine Learning (SaTML), pages 339–364. IEEE, 2023.
[8] Arthur. Shield, 2023.
[9] Giuseppe Ateniese, Luigi V. Mancini, Angelo Spognardi, Antonio Villani,
Domenico Vitali, and Giovanni Felici. Hacking smart machines with smarter ones:
How to extract meaningful data from machine learning classifers. Int. J. Secur.
Netw., 10(3):137–150, September 2015.
[10] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give
a false sense of security: Circumventing defenses to adversarial examples. In Jen-
nifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Con-
ference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden,
July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages
274–283. PMLR, 2018.
[11] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing
robust adversarial examples, 2018.
[12] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly
Shmatikov. How to backdoor federated learning. In Silvia Chiappa and Roberto
Calandra, editors, Proceedings of the Twenty Third International Conference on Ar-
tifcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning
Research, pages 2938–2948. PMLR, 26–28 Aug 2020.
[13] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly
Shmatikov. How to backdoor federated learning. In AISTATS. PMLR, 2020.
57
NIST AI 100-2e2023
January 2024
[14] Marieke Bak, Vince Istvan Madai, Marie-Christine Fritzsche, Michaela Th.
Mayrhofer, and Stuart McLennan. You can’t have ai both ways: Balancing health
data privacy and access fairly. Frontiers in Genetics, 13, 2022. https://www.frontier
sin.org/articles/10.3389/fgene.2022.929453.
[15] Borja Balle, Giovanni Cherubin, and Jamie Hayes. Reconstructing training data with
informed adversaries. In NeurIPS 2021 Workshop on Privacy in Machine Learning
(PRIML), 2021.
[16] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal ma-
chine learning: A survey and taxonomy, 2017.
[17] Anthony M. Barrett, Dan Hendrycks, Jessica Newman, and Brandie Nonnecke. UC
Berkeley AI Risk-Management Standards Profle for General-Purpose AI Systems
(GPAIS) and Foundation Models. UC Berkeley Center for Long Term Cybersecurity,
2023. https://cltc.berkeley.edu/seeking-input-and-feedback-ai-risk-management-st
andards-profile-for-increasingly-multi-purpose-or-general-purpose-ai/.
[18] Lejla Batina, Shivam Bhasin, Dirmanto Jap, and Stjepan Picek. CSI NN: Reverse
engineering of neural network architectures through electromagnetic side channel.
In Proceedings of the 28th USENIX Conference on Security Symposium, SEC’19,
page 515–532, USA, 2019. USENIX Association.
[19] Khaled Bayoudh, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa. A survey
on deep multimodal learning for computer vision: Advances, trends, applications,
and datasets. Vis. Comput., 38(8):2939–2970, August 2022.
[20] Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev
McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from
transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
[21] Philipp Benz, Chaoning Zhang, Soomin Ham, Gyusang Karjauv, Adil Cho, and
In So Kweon. The triangular trade-off between accuracy, robustness, and fairness.
Workshop on Adversarial Machine Learning in Real-World Computer Vision Sys-
tems and Online Challenges (AML-CV) at CVPR, 2021.
[22] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver,
and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems 32, pages 5050–5060.
Curran Associates, Inc., 2019.
[23] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo.
Model Poisoning Attacks in Federated Learning. In NeurIPS SECML, 2018.
[24] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. An-
alyzing federated learning through an adversarial lens. In Kamalika Chaudhuri and
Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages
634–643. PMLR, 09–15 Jun 2019.
[25] Battista Biggio, Igino Corona, Giorgio Fumera, Giorgio Giacinto, and Fabio Roli.
Bagging classifers for fghting poisoning attacks in adversarial classifcation tasks.
58
NIST AI 100-2e2023
January 2024
59
NIST AI 100-2e2023
January 2024
Machinery.
[37] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
et al. Language models are few-shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901, 2020.
[38] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo-
pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165,
2020.
[39] Gon Buzaglo, Niv Haim, Gilad Yehudai, Gal Vardi, and Michal Irani. Reconstruct-
ing training data from multiclass neural networks, 2023.
[40] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. FLTrust:
Byzantine-robust federated learning via trust bootstrapping. In NDSS, 2021.
[41] Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine un-
learning. In 2015 IEEE Symposium on Security and Privacy, pages 463–480, 2015.
[42] Nicholas Carlini. Poisoning the unlabeled dataset of Semi-Supervised learning.
In 30th USENIX Security Symposium (USENIX Security 21), pages 1577–1592.
USENIX Association, August 2021.
[43] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Flo-
rian Tramer. Membership inference attacks from frst principles. In 2022 IEEE
Symposium on Security and Privacy (S&P), pages 1519–1519, Los Alamitos, CA,
USA, May 2022. IEEE Computer Society.
[44] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Flo-
rian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data
from diffusion models, 2023.
[45] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian
Tramer, and Chiyuan Zhang. Quantifying memorization across neural language
models. https://arxiv.org/abs/2202.07646, 2022.
[46] Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka,
Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramer. `
Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149,
2023.
[47] Nicholas Carlini, Matthew Jagielski, and Ilya Mironov. Cryptanalytic extraction
of neural network models. In Daniele Micciancio and Thomas Ristenpart, editors,
Advances in Cryptology – CRYPTO 2020, pages 189–218, Cham, 2020. Springer
International Publishing.
[48] ´
Nicholas Carlini, Chang Liu, Ulfar Erlingsson, Jernej Kos, and Dawn Song. The
Secret Sharer: Evaluating and testing unintended memorization in neural networks.
60
NIST AI 100-2e2023
January 2024
61
NIST AI 100-2e2023
January 2024
[62] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Ed-
wards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks
on deep neural networks by activation clustering. https://arxiv.org/abs/1811.03728,
2018.
[63] Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking
visual language grounding with adversarial examples: A case study on neural image
captioning. https://arxiv.org/abs/1712.02051, 2017.
[64] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. DeepInspect: A black-
box trojan detection and mitigation framework for deep neural networks. In Proceed-
ings of the Twenty-Eighth International Joint Conference on Artifcial Intelligence,
IJCAI-19, pages 4658–4664. International Joint Conferences on Artifcial Intelli-
gence Organization, 7 2019.
[65] Jianbo Chen, Michael I. Jordan, and Martin J. Wainwright. HopSkipJumpAttack:
A query-effcient decision-based attack. In 2020 IEEE Symposium on Security and
Privacy, SP 2020, San Francisco, CA, USA, May 18-21, 2020, pages 1277–1294.
IEEE, 2020.
[66] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Ze-
roth order optimization based black-box attacks to deep neural networks without
training substitute models. In Proceedings of the 10th ACM Workshop on Artif-
cial Intelligence and Security, AISec ’17, page 15–26, New York, NY, USA, 2017.
Association for Computing Machinery.
[67] Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Chau.
ShapeShifter: Robust Physical Adversarial Attack on Faster R-CNN Object Detec-
tor, page 52–68. Springer International Publishing, 2019.
[68] Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni
Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models
with semantic-preserving improvements. In Annual Computer Security Applications
Conference, ACSAC ’21, page 554–569, New York, NY, USA, 2021. Association for
Computing Machinery.
[69] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted
backdoor attacks on deep learning systems using data poisoning. arXiv preprint
arXiv:1712.05526, 2017.
[70] Heng-Tze Cheng and Romal Thoppilan. LaMDA: Towards Safe, Grounded, and
High-Quality Dialog Models for Everything. https://ai.googleblog.com/2022/01/la
mda-towards-safe-grounded-and-high.html, 2022. Google Brain.
[71] Minhao Cheng, Thong Le, Pin-Yu Chen, Huan Zhang, Jinfeng Yi, and Cho-Jui
Hsieh. Query-effcient hard-label black-box attack: An optimization-based ap-
proach. In 7th International Conference on Learning Representations, ICLR 2019,
New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[72] Minhao Cheng, Simranjit Singh, Patrick H. Chen, Pin-Yu Chen, Sijia Liu, and Cho-
Jui Hsieh. Sign-opt: A query-effcient hard-label adversarial attack. In International
Conference on Learning Representations, 2020.
62
NIST AI 100-2e2023
January 2024
[73] Alesia Chernikova and Alina Oprea. FENCE: Feasible evasion attacks on neural
networks in constrained environments. ACM Transactions on Privacy and Security
(TOPS) Journal, 2022.
[74] Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Pa-
pernot. Label-only membership inference attacks. In Marina Meila and Tong Zhang,
editors, Proceedings of the 38th International Conference on Machine Learning, vol-
ume 139 of Proceedings of Machine Learning Research, pages 1964–1974. PMLR,
18–24 Jul 2021.
[75] Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion mod-
els? https://arxiv.org/abs/2212.05400, 2022.
[76] Antonio Emanuele Cina, ` Kathrin Grosse, Ambra Demontis, Sebastiano Vascon,
Werner Zellinger, Bernhard A. Moser, Alina Oprea, Battista Biggio, Marcello
Pelillo, and Fabio Roli. Wild patterns reloaded: A survey of machine learning secu-
rity against training data poisoning. ACM Computing Surveys, March 2023.
[77] Jack Clark and Raymond Perrault. 2022 AI index report. https://aiindex.stanford.e
du/wp-content/uploads/2022/03/2022-AI-Index-Report Master.pdf, 2022. Human
Centered AI, Stanford University.
[78] Joseph Clements, Yuzhe Yang, Ankur Sharma, Hongxin Hu, and Yingjie Lao. Ral-
lying adversarial techniques against deep learning for network security, 2019.
[79] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certifed adversarial robustness via
randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97
of Proceedings of Machine Learning Research, pages 1310–1320. PMLR, 09–15
Jun 2019.
[80] Committee on National Security Systems. Committee on National Security Systems
(CNSS) Glossary. https://rmf.org/wp-content/uploads/2017/10/CNSSI-4009.pdf,
2015.
[81] Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti,
Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robust-
bench: a standardized adversarial robustness benchmark. In Thirty-ffth Conference
on Neural Information Processing Systems Datasets and Benchmarks Track (Round
2), 2021.
[82] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Ad-
versarial classifcation. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’04, page 99–108,
New York, NY, USA, 2004. Association for Computing Machinery.
[83] DeepMind. Building safer dialogue agents. https://www.deepmind.com/blog/buildi
ng-safer-dialogue-agents, 2022. Online.
[84] Luca Demetrio, Battista Biggio, Giovanni Lagorio, Fabio Roli, and Alessandro Ar-
mando. Functionality-preserving black-box optimization of adversarial windows
malware. IEEE Transactions on Information Forensics and Security, 16:3469–3478,
2021.
63
NIST AI 100-2e2023
January 2024
[85] Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio,
Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. Why do adversarial attacks trans-
fer? Explaining transferability of evasion and poisoning attacks. In 28th USENIX
Security Symposium (USENIX Security 19), pages 321–338. USENIX Association,
2019.
[86] Serguei Denissov, Hugh Brendan McMahan, J Keith Rush, Adam Smith, and
Abhradeep Guha Thakurta. Improved differential privacy for SGD via optimal pri-
vate linear operators on adaptive streams. In Alice H. Oh, Alekh Agarwal, Danielle
Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing
Systems, 2022.
[87] Ben Derico. Chatgpt bug leaked users’ conversation histories, 2023.
[88] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and
Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In
International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
[89] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In
Proceedings of the 22nd ACM Symposium on Principles of Database Systems, PODS
’03, pages 202–210. ACM, 2003.
[90] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth
16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929,
2021.
[91] Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, and Kush R.
Varshney. Is there a trade-off between fairness and accuracy? A perspective using
mismatched hypothesis testing. In Proceedings of the 37th International Conference
on Machine Learning, ICML’20. JMLR.org, 2020.
[92] Cynthia Dwork. Differential privacy. In Automata, Languages and Programming,
33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Pro-
ceedings, Part II, pages 1–12, 2006.
[93] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise
to sensitivity in private data analysis. In Conference on Theory of Cryptography,
TCC ’06, pages 265–284, New York, NY, USA, 2006.
[94] Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. Exposed! A
survey of attacks on private data. Annual Review of Statistics and Its Application,
4:61–84, 2017.
[95] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan.
Robust traceability from trace amounts. In IEEE Symposium on Foundations of
Computer Science, FOCS ’15, 2015.
[96] Cynthia Dwork and Sergey Yekhanin. New effcient attacks on statistical disclosure
control mechanisms. In Annual International Cryptology Conference, pages 469–
480. Springer, 2008.
[97] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotfip: White-box ad-
64
NIST AI 100-2e2023
January 2024
65
NIST AI 100-2e2023
January 2024
66
NIST AI 100-2e2023
January 2024
67
NIST AI 100-2e2023
January 2024
[131] Xiaoling Hu, Xiao Lin, Michael Cogswell, Yi Yao, Susmit Jha, and Chao Chen.
Trigger hunting with a topological prior for trojan detection. In International Con-
ference on Learning Representations, 2022.
[132] Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained lan-
guage models leaking your personal information? arXiv preprint arXiv:2205.12628,
2022.
[133] W. Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, and Tom Goldstein.
Metapoison: Practical general-purpose clean-label data poisoning. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural In-
formation Processing Systems, volume 33, pages 12080–12091. Curran Associates,
Inc., 2020.
[134] Xijie Huang, Moustafa Alzantot, and Mani Srivastava. NeuronInspect: Detecting
backdoors in neural networks via output explanations, 2019.
[135] W. Nicholson Price II. Risks and remedies for artifcial intelligence in health care.
https://www.brookings.edu/research/risks-and-remedies-for-artifcial-intelligence-i
n-health-care/, 2019. Brookings Report.
[136] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adver-
sarial attacks with limited queries and information. In Jennifer G. Dy and An-
dreas Krause, editors, Proceedings of the 35th International Conference on Ma-
chine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15,
2018, volume 80 of Proceedings of Machine Learning Research, pages 2142–2151.
PMLR, 2018.
[137] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-
box adversarial attacks with bandits and priors. In International Conference on
Learning Representations, 2019.
[138] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran,
and Aleksander Madry. Adversarial examples are not bugs, they are features. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 32. Curran
Associates, Inc., 2019.
[139] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approxi-
mate data deletion from machine learning models. In Arindam Banerjee and Kenji
Fukumizu, editors, The 24th International Conference on Artifcial Intelligence and
Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceed-
ings of Machine Learning Research, pages 2008–2016. PMLR, 2021.
[140] Shahin Jabbari, Han-Ching Ou, Himabindu Lakkaraju, and Milind Tambe. An em-
pirical study of the trade-offs between interpretability and fairness. In ICML Work-
shop on Human Interpretability in Machine Learning, International Conference on
Machine Learning (ICML), 2020.
[141] Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas
Papernot. High accuracy and high fdelity extraction of neural networks. In Pro-
ceedings of the 29th USENIX Conference on Security Symposium, SEC’20, USA,
68
NIST AI 100-2e2023
January 2024
69
NIST AI 100-2e2023
January 2024
David Evans, Josh Gardner, Zachary Garrett, Adri`a Gasc´on, Badih Ghazi, Phillip B.
Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo,
Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak,
Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède
Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Ozgür,¨
Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn
Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Flo-
rian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang,
Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated
learning, 2019.
[153] Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori
Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard
security attacks. arXiv preprint arXiv:2302.05733, 2023.
[154] Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer.
Reluplex: An effcient SMT solver for verifying deep neural networks. In Rupak
Majumdar and Viktor Kunčak, editors, Computer Aided Verifcation, pages 97–117,
Cham, 2017. Springer International Publishing.
[155] Michael Kearns and Ming Li. Learning in the presence of malicious errors. In
Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing,
STOC ’88, page 267–280, New York, NY, USA, 1988. Association for Computing
Machinery.
[156] Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi
Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In An-
dreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,
and Jonathan Scarlett, editors, Proceedings of the 40th International Conference
on Machine Learning, volume 202 of Proceedings of Machine Learning Research,
pages 16216–16236. PMLR, 23–29 Jul 2023.
[157] Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin,
Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron
Ho, Elizabeth Barnes, and Paul Christiano. Evaluating Language-Model Agents on
Realistic Autonomous Tasks. https://evals.alignment.org/Evaluating LMAs Realist
ic Tasks.pdf, 2023.
[158] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom
Goldstein. A watermark for large language models, 2023.
[159] Marius Kloft and Pavel Laskov. Security analysis of online centroid anomaly detec-
tion. Journal of Machine Learning Research, 13(118):3681–3724, 2012.
[160] Pang Wei Koh and Percy Liang. Understanding black-box predictions via infu-
ence functions. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 1885–1894. JMLR. org, 2017.
[161] Moshe Kravchik, Battista Biggio, and Asaf Shabtai. Poisoning attacks on cyber
attack detectors for industrial control systems. In Proceedings of the 36th Annual
ACM Symposium on Applied Computing, SAC ’21, page 116–125, New York, NY,
70
NIST AI 100-2e2023
January 2024
71
NIST AI 100-2e2023
January 2024
[176] Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang.
Invisible backdoor attacks on deep neural networks via steganography and regular-
ization. IEEE Transactions on Dependable and Secure Computing, 18:2088–2105,
2021.
[177] Shasha Li, Ajaya Neupane, Sujoy Paul, Chengyu Song, Srikanth V. Krishnamurthy,
Amit K. Roy-Chowdhury, and Ananthram Swami. Adversarial perturbations against
real-time video classifcation systems. CoRR, abs/1807.00458, 2018.
[178] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi-
hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al.
Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
[179] Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When effciency meets
robustness. ArXiv, abs/1904.08444, 2019.
[180] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending
against backdooring attacks on deep neural networks. In Michael Bailey, Sotiris
Ioannidis, Manolis Stamatogiannakis, and Thorsten Holz, editors, Research in At-
tacks, Intrusions, and Defenses - 21st International Symposium, RAID 2018, Pro-
ceedings, Lecture Notes in Computer Science, pages 273–294. Springer Verlag,
2018.
[181] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable ad-
versarial examples and black-box attacks. In International Conference on Learning
Representations, 2017.
[182] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu
Wang, Yan Zheng, and Yang Liu. Prompt injection attack against llm-integrated
applications. arXiv preprint arXiv:2306.05499, 2023.
[183] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida
Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering:
An empirical study. arXiv preprint arXiv:2305.13860, 2023.
[184] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xi-
angyu Zhang. ABS: Scanning neural networks for back-doors by artifcial brain
stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer
and Communications Security, CCS ’19, page 1265–1282, New York, NY, USA,
2019. Association for Computing Machinery.
[185] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang,
and Xiangyu Zhang. Trojaning attack on neural networks. In NDSS. The Internet
Society, 2018.
[186] Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. Refection backdoor: A natural
backdoor attack on deep neural networks. In Andrea Vedaldi, Horst Bischof, Thomas
Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 182–
199, Cham, 2020. Springer International Publishing.
[187] Martin Bertran Lopez, Shuai Tang, Michael Kearns, Jamie Morgenstern, Aaron
Roth, and Zhiwei Steven Wu. Scalable membership inference attacks via quantile
regression. In NeurIPS 2023, 2023.
72
NIST AI 100-2e2023
January 2024
[188] Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the
Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data
Mining, KDD ’05, page 641–647, New York, NY, USA, 2005. Association for Com-
puting Machinery.
[189] Jiajun Lu, Hussein Sibai, Evan Fabry, and David Forsyth. No need to worry about
adversarial examples in object detection in autonomous vehicles, 2017.
[190] Yiwei Lu, Gautam Kamath, and Yaoliang Yu. Indiscriminate data poisoning attacks
on neural networks. https://arxiv.org/abs/2204.09092, 2022.
[191] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santi-
ago Zanella-Béguelin. Analyzing leakage of personally identifable information in
language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages
346–363. IEEE Computer Society, 2023.
[192] Yuzhe Ma, Xiaojin Zhu, and Justin Hsu. Data poisoning against differentially-private
learners: Attacks and defenses. In In Proceedings of the 28th International Joint
Conference on Artifcial Intelligence (IJCAI), 2019.
[193] Pooria Madani and Natalija Vlajic. Robustness of deep autoencoder in intrusion
detection under adversarial contamination. In HoTSoS ’18: Proceedings of the 5th
Annual Symposium and Bootcamp on Hot Topics in the Science of Security, pages
1–8, 04 2018.
[194] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC,
Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
2018.
[195] Saeed Mahloujifar, Esha Ghosh, and Melissa Chase. Property inference from poi-
soning. In 2022 IEEE Symposium on Security and Privacy (S&P), pages 1120–1137,
2022.
[196] James Manyika and Sissie Hsiao. An overview of Bard: an early experiment with
generative AI. https://ai.google/static/documents/google-about-bard.pdf, February
2023. Google.
[197] Shiona McCallum. Chatgpt banned in italy over privacy concerns, 2023.
[198] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In
IEEE Symposium on Foundations of Computer Science, FOCS ’07, pages 94–103,
Las Vegas, NV, USA, 2007.
[199] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum An-
derson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box
llms automatically. arXiv preprint arXiv:2312.02119, 2023.
[200] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Ex-
ploiting unintended feature leakage in collaborative learning. In 2019 IEEE Sympo-
sium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019,
pages 691–706. IEEE, 2019.
¨ What does GPT-3 “know” about me? https://www.technologyre
[201] Melissa Heikkila.
73
NIST AI 100-2e2023
January 2024
74
NIST AI 100-2e2023
January 2024
75
NIST AI 100-2e2023
January 2024
76
NIST AI 100-2e2023
January 2024
– ECCV 2020 Workshops, pages 55–70, Cham, 2020. Springer International Pub-
lishing.
[239] Fabio Pierazzi, Feargus Pendlebury, Jacopo Cortellazzi, and Lorenzo Cavallaro. In-
triguing properties of adversarial ML attacks in the problem space. In 2020 IEEE
Symposium on Security and Privacy (S&P), pages 1308–1325. IEEE Computer So-
ciety, 2020.
[240] Krishna Pillutla, Galen Andrew, Peter Kairouz, H. Brendan McMahan, Alina Oprea,
and Sewoong Oh. Unleashing the power of randomization in auditing differentially
private ml. In Advances in Neural Information Processing Systems, 2023.
[241] Jonathan Protzenko, Bryan Parno, Aymeric Fromherz, Chris Hawblitzel, Marina
Polubelova, Karthikeyan Bhargavan, Benjamin Beurdouche, Joonwon Choi, An-
toine Delignat-Lavaud, Cédric Fournet, Natalia Kulatova, Tahina Ramananandro,
Aseem Rastogi, Nikhil Swamy, Christoph Wintersteiger, and Santiago Zanella-
Beguelin. EverCrypt: A fast, verifed, cross-platform cryptographic provider. In
Proceedings of the IEEE Symposium on Security and Privacy (Oakland), May 2020.
[242] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and
Peter Henderson. Fine-tuning aligned language models compromises safety, even
when users do not intend to! https://arxiv.org/abs/2310.03693, 2023.
[243] Gauthama Raman M. R., Chuadhry Mujeeb Ahmed, and Aditya Mathur. Machine
learning for intrusion detection in industrial control systems: Challenges and lessons
from experimental evaluation. Cybersecurity, 4(27), 2021.
[244] Aida Rahmattalabi, Shahin Jabbari, Himabindu Lakkaraju, Phebe Vayanos, Max
Izenberg, Ryan Brown, Eric Rice, and Milind Tambe. Fair infuence maximization:
A welfare optimization approach. In Proceedings of the AAAI Conference on Artif-
cial Intelligence 35th, 2021.
[245] Adnan Siraj Rakin, Md Hafzul Islam Chowdhuryy, Fan Yao, and Deliang Fan.
DeepSteal: Advanced model extractions leveraging effcient weight stealing in mem-
ories. In 2022 IEEE Symposium on Security and Privacy (S&P), pages 1157–1174,
2022.
[246] Dhanesh Ramachandram and Graham W. Taylor. Deep multimodal learning: A
survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–
108, 2017.
[247] Robust Intelligence. AI Firewall, 2023.
[248] Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and Zico Kolter. Certifed ro-
bustness to label-fipping attacks via randomized smoothing. In International Con-
ference on Machine Learning, pages 8230–8241. PMLR, 2020.
[249] Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-
hon Lau, Satish Rao, Nina Taft, and J Doug Tygar. Antidote: understanding and
defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement, pages 1–14, 2009.
[250] Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Yann Ollivier, and Hervé
Jégou. White-box vs black-box: Bayes optimal strategies for membership inference.
77
NIST AI 100-2e2023
January 2024
78
NIST AI 100-2e2023
January 2024
79
NIST AI 100-2e2023
January 2024
[274] Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. An abstract
domain for certifying neural networks. Proc. ACM Program. Lang., 3, January 2019.
[275] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini,
Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. FixMatch: Simplifying
semi-supervised learning with consistency and confdence. In Proceedings of the
34th International Conference on Neural Information Processing Systems, NIPS’20,
Red Hook, NY, USA, 2020. Curran Associates Inc.
[276] Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael
Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna
Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv
Verma, Gokhan Tur, and Prem Natarajan. AlexaTM 20B: Few-shot learning using a
large-scale multilingual seq2seq model. https://www.amazon.science/publications/
alexatm-20b-few-shot-learning-using-a-large-scale-multilingual-seq2seq-model,
2022. Amazon.
[277] Dawn Song, Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rah-
` Atul Prakash, and Tadayoshi Kohno. Physical adversarial ex-
mati, Florian Tramer,
amples for object detectors. In 12th USENIX Workshop on Offensive Technologies
(WOOT 18), Baltimore, MD, August 2018. USENIX Association.
[278] Shuang Song and David Marn. Introducing a new privacy testing library in Tensor-
Flow, 2020.
[279] N. Srndic and P. Laskov. Practical evasion of a learning-based classifer: A case
study. In Proc. IEEE Security and Privacy Symposium, 2014.
[280] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certifed defenses for data
poisoning attacks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Process-
ing Systems, volume 30. Curran Associates, Inc., 2017.
[281] Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1)
training run. In Advances in Neural Information Processing Systems, 2023.
[282] Octavian Suciu, Scott E Coull, and Jeffrey Johns. Exploring adversarial examples
in malware detection. In 2019 IEEE Security and Privacy Workshops (SPW), pages
8–14. IEEE, 2019.
[283] Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Du-
mitras. When does machine learning FAIL? generalized transferability for evasion
and poisoning attacks. In 27th USENIX Security Symposium (USENIX Security 18),
pages 1299–1316, 2018.
[284] Jingwei Sun, Ang Li, Louis DiValentin, Amin Hassanzadeh, Yiran Chen, and Hai
Li. FL-WBC: Enhancing robustness against model poisoning attacks in federated
learning from a client perspective. In NeurIPS, 2021.
[285] Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, and H Brendan McMahan. Can
you really backdoor federated learning? arXiv:1911.07963, 2019.
[286] Anshuman Suri and David Evans. Formalizing and estimating distribution inference
risks. Proceedings on Privacy Enhancing Technologies, 2022.
80
NIST AI 100-2e2023
January 2024
[287] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Inter-
national Conference on Learning Representations, 2014.
[288] Rahim Taheri, Reza Javidan, Mohammad Shojafar, Zahra Pooranian, Ali Miri, and
Mauro Conti. On defending against label fipping attacks on malware detection
systems. CoRR, abs/1908.04473, 2019.
[289] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste
Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth,
Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou,
Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap,
Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham,
Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu,
Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Ka-
reem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain
Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaı̈s White, Anders An-
dreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez,
Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura
Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico
Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao,
Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Lau-
rent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shan-
tanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi,
Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Co-
gan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse
Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter,
Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael
Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puig-
domènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey
Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Win-
kler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki,
Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown,
Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ash-
wood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay
Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur
Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu,
Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter
Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei,
Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shub-
ham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srini-
vasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand,
Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Za-
heer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Di-
81
NIST AI 100-2e2023
January 2024
82
NIST AI 100-2e2023
January 2024
Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi
Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira
Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina
Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr,
Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson,
YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault
Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer
Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi,
Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra,
Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Jun-
hyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao,
Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan
Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan
Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu
Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut
Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk,
Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson,
Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan
Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park,
Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar,
Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac,
Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Pe-
ter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman
Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Man-
aal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso
Casta˜no, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybi´nski, Ashwin Sreevatsa,
Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen
Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha
Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu,
Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev,
Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina,
Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa,
Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama
Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor,
Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang,
Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian
Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao
Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica
Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Or-
gad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Pe-
ter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov,
Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C.
83
NIST AI 100-2e2023
January 2024
Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-
Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Ba-
narse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar,
Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor
Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem
Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polo-
zov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni,
Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey,
Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofr Roval, Reiko
Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri,
Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Sev-
eryn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat,
Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian
Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong
Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey
Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel
Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman,
John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafe, Tanya Grunina,
Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie,
Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer
Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura
Knight, Am´elie H´eliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li,
Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam
Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan,
Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi
Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Sum-
mer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, So-
heil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram
Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha
Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz,
Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron
Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-
Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan,
Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Des-
jardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish
Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière,
Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement
Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fid-
jeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David
Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker,
Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth
Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anas-
84
NIST AI 100-2e2023
January 2024
tasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen,
Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi La-
hoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani,
Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mi-
hir Sanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun,
Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John
Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev,
Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim,
Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy
Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang,
Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang
Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily
Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang,
Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Guru-
murthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Ur-
banowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden,
Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul,
Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Gin-
ger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan,
Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis,
Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. Gemini: A family of highly
capable multimodal models. https://arxiv.org/abs/2312.11805, 2023.
[290] The White House. Executive Order on the Safe, Secure, and Trustworthy Develop-
ment and Use of Artifcial Intelligence. https://www.whitehouse.gov/briefing-room
/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustw
orthy-development-and-use-of-artificial-intelligence/, October 2023. The White
House.
[291] Anvith Thudi, Ilia Shumailov, Franziska Boenisch, and Nicolas Papernot. Bounding
membership inference. https://arxiv.org/abs/2202.12232, 2022.
[292] Lionel Nganyewou Tidjon and Foutse Khomh. Threat assessment in machine learn-
ing based systems. arXiv preprint arXiv:2207.00091, 2022.
[293] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample.
Llama: Open and effcient foundation language models, 2023.
[294] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucu-
rull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia
Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui
Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya
85
NIST AI 100-2e2023
January 2024
Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov,
Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,
Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ran-
jan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams,
Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey
Edunov, and Thomas Scialom. Llama 2: Open foundation and fne-tuned chat mod-
els, 2023.
[295] Florian Tramer. Detecting adversarial examples is (Nearly) as hard as classifying
them. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang
Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference
on Machine Learning, volume 162 of Proceedings of Machine Learning Research,
pages 21692–21702. PMLR, 17–23 Jul 2022.
[296] Florian Tramer, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, and Joern-
Henrik Jacobsen. Fundamental tradeoffs between invariance and sensitivity to adver-
sarial perturbations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the
37th International Conference on Machine Learning, volume 119 of Proceedings of
Machine Learning Research, pages 9561–9571. PMLR, 13–18 Jul 2020.
[297] Florian Tramer,` Nicholas Carlini, Wieland Brendel, and Aleksander Ma̧dry. On
adaptive attacks to adversarial example defenses. In Proceedings of the 34th In-
ternational Conference on Neural Information Processing Systems, NIPS’20, Red
Hook, NY, USA, 2020. Curran Associates Inc.
[298] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart.
Stealing machine learning models via prediction APIs. In USENIX Security, 2016.
[299] Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc-
Daniel. The space of transferable adversarial examples. https://arxiv.org/abs/1704.0
3453, 2017.
[300] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor
attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018.
[301] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Alek-
sander Madry. Robustness may be at odds with accuracy. In International Confer-
ence on Learning Representations, 2019.
[302] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor
attacks. In ICLR, 2019.
[303] Apostol Vassilev, Honglan Jin, and Munawar Hasan. Meta learning with language
models: Challenges and opportunities in the classifcation of imbalanced text, 2023.
[304] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in neural information processing systems, pages 5998–6008, 2017.
[305] Sridhar Venkatesan, Harshvardhan Sikka, Rauf Izmailov, Ritu Chadha, Alina Oprea,
86
NIST AI 100-2e2023
January 2024
and Michael J. De Lucia. Poisoning attacks and data sanitization mitigations for ma-
chine learning models in network intrusion detection systems. In MILCOM, pages
874–879. IEEE, 2021.
[306] Brandon Vigliarolo. GPT-3 ’prompt injection’ attack causes bad bot manners. https:
//www.theregister.com/2022/09/19/in brief security/, 2022. The Register, Online.
[307] Vincent, James. Google and Microsoft’s chatbots are already citing one another in a
misinformation shitshow, 2023.
[308] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Uni-
versal adversarial triggers for attacking and analyzing NLP. arXiv preprint
arXiv:1908.07125, 2019.
[309] Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. Concealed data poisoning
attacks on NLP models. In NAACL, 2021.
[310] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao
Zheng, and Ben Y. Zhao. Neural Cleanse: Identifying and Mitigating Backdoor
Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy
(SP), pages 707–723, San Francisco, CA, USA, May 2019. IEEE.
[311] Haotao Wang, Tianlong Chen, Shupeng Gui, Ting-Kuei Hu, Ji Liu, and Zhangyang
Wang. Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness and
Accuracy for Free. In Proceedings of the 34th Conference on Neural Information
Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020.
[312] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh
Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris Papailiopoulos. Attack of the
Tails: Yes, You Really Can Backdoor Federated Learning. In NeurIPS, 2020.
[313] Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Formal
security analysis of neural networks using symbolic intervals. In 27th USENIX Se-
curity Symposium (USENIX Security 18), pages 1599–1614, Baltimore, MD, August
2018. USENIX Association.
[314] Wenxiao Wang, Alexander Levine, and Soheil Feizi. Improved certifed defenses
against data poisoning with (deterministic) fnite aggregation. In Kamalika Chaud-
huri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato,
editors, International Conference on Machine Learning, ICML 2022, 17-23 July
2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning
Research, pages 22769–22783. PMLR, 2022.
[315] Xiaosen Wang and Kun He. Enhancing the transferability of adversarial attacks
through variance tuning. In IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 1924–1933. Computer
Vision Foundation / IEEE, 2021.
[316] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm
safety training fail? arXiv preprint arXiv:2307.02483, 2023.
[317] Xingxing Wei, Jun Zhu, Sha Yuan, and Hang Su. Sparse adversarial perturbations
for videos. In Proceedings of the Thirty-Third AAAI Conference on Artifcial In-
telligence and Thirty-First Innovative Applications of Artifcial Intelligence Confer-
87
NIST AI 100-2e2023
January 2024
88
NIST AI 100-2e2023
January 2024
[330] Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun
Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned
language models, 2023.
[331] Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y. Zhao. Latent backdoor attacks
on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on
Computer and Communications Security, CCS ’19, page 2041–2055, New York,
NY, USA, 2019. Association for Computing Machinery.
[332] Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and
Reza Shokri. Enhanced membership inference attacks against machine learning
models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and
Communications Security, CCS ’22, page 3093–3106, New York, NY, USA, 2022.
Association for Computing Machinery.
[333] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk
in machine learning: Analyzing the connection to overftting. In IEEE Computer
Security Foundations Symposium, CSF ’18, pages 268–282, 2018. https://arxiv.org/
abs/1709.01604.
[334] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-
Robust Distributed Learning: Towards Optimal Statistical Rates. In ICML, 2018.
[335] Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim, Jung Uk Kim, and Yong Man
Ro. Investigating vulnerability to adversarial examples on multimodal data fusion in
deep learning. https://arxiv.org/abs/2005.10987, 2020. Online.
[336] Andrew Yuan, Alina Oprea, and Cheng Tan. Dropout attacks. In IEEE Symposium
on Security and Privacy (S&P), 2024.
[337] Santiago Zanella-B´eguelin, Lukas Wutschitz, Shruti Tople, Victor R¨uhle, Andrew
Paverd, Olga Ohrimenko, Boris Köpf, and Marc Brockschmidt. Analyzing informa-
tion leakage of updates to natural language models. In Proceedings of the 2020 ACM
SIGSAC Conference on Computer and Communications Security, page 363–375,
New York, NY, USA, 2020. Association for Computing Machinery.
[338] Santiago Zanella-Beguelin, Lukas Wutschitz, Shruti Tople, Ahmed Salem, Victor
Rühle, Andrew Paverd, Mohammad Naseri, Boris Köpf, and Daniel Jones. Bayesian
estimation of differential privacy. In Andreas Krause, Emma Brunskill, Kyunghyun
Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings
of the 40th International Conference on Machine Learning, volume 202 of Proceed-
ings of Machine Learning Research, pages 40624–40636. PMLR, 23–29 Jul 2023.
[339] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali
Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models, 2021.
[340] Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming Jin, and Ruoxi Jia. Adversarial
unlearning of backdoors via implicit hypergradient. In International Conference on
Learning Representations, 2022.
[341] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
Understanding deep learning (still) requires rethinking generalization. Commun.
ACM, 64(3):107–115, feb 2021.
89
NIST AI 100-2e2023
January 2024
[342] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and
Michael Jordan. Theoretically principled trade-off between robustness and accu-
racy. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the
36th International Conference on Machine Learning, volume 97 of Proceedings of
Machine Learning Research, pages 7472–7482. PMLR, 09–15 Jun 2019.
[343] Ruisi Zhang, Seira Hidano, and Farinaz Koushanfar. Text revealer: Private text
reconstruction via model inversion attacks against transformers. arXiv preprint
arXiv:2209.10505, 2022.
[344] Su-Fang Zhang, Jun-Hai Zhai, Bo-Jun Xie, Yan Zhan, and Xin Wang. Multi-
modal representation learning: Advances, trends and challenges. In 2019 Inter-
national Conference on Machine Learning and Cybernetics (ICMLC), pages 1–6.
IEEE, 2019.
[345] Susan Zhang, Mona Diab, and Luke Zettlemoyer. Democratizing access to large-
scale language models with OPT-175B. https://ai.facebook.com/blog/democratizi
ng-access-to-large-scale-language-models-with-opt-175b/, 2022. Meta AI.
[346] Wanrong Zhang, Shruti Tople, and Olga Ohrimenko. Leakage of dataset properties
in Multi-Party machine learning. In 30th USENIX Security Symposium (USENIX
Security 21), pages 2687–2704. USENIX Association, August 2021.
[347] Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial
attacks on deep-learning models in natural language processing: A survey. ACM
Trans. Intell. Syst. Technol., 11(3), apr 2020.
[348] Yiming Zhang and Daphne Ippolito. Prompts should not be seen as se-
crets: Systematically measuring prompt extraction attack success. arXiv preprint
arXiv:2307.06865, 2023.
[349] Zhengming Zhang, Ashwinee Panda, Linyue Song, Yaoqing Yang, Michael Ma-
honey, Prateek Mittal, Ramchandran Kannan, and Joseph Gonzalez. Neurotoxin:
Durable backdoors in federated learning. In Kamalika Chaudhuri, Stefanie Jegelka,
Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the
39th International Conference on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, pages 26429–26446. PMLR, 17–23 Jul 2022.
[350] Zhikun Zhang, Min Chen, Michael Backes, Yun Shen, and Yang Zhang. Infer-
ence attacks against graph neural networks. In 31st USENIX Security Symposium
(USENIX Security 22), 2022.
[351] Junhao Zhou, Yufei Chen, Chao Shen, and Yang Zhang. Property inference attacks
against GANs. In Proceedings of Network and Distributed System Security, NDSS,
2022.
[352] Chen Zhu, W. Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, and Tom
Goldstein. Transferable clean-label poisoning attacks on deep neural nets. In Ka-
malika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Inter-
national Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 7614–7623. PMLR, 09–15 Jun 2019.
[353] Giulio Zizzo, Chris Hankin, Sergio Maffeis, and Kevin Jones. Adversarial machine
90
NIST AI 100-2e2023
January 2024
learning beyond the image domain. In Proceedings of the 56th Annual Design Au-
tomation Conference 2019, DAC ’19, New York, NY, USA, 2019. Association for
Computing Machinery.
[354] Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and
transferable adversarial attacks on aligned language models. arXiv preprint
arXiv:2307.15043, 2023.
91
NIST AI 100-2e2023
January 2024
Appendix: Glossary
Note: one may click on the page number shown at the end of the defnition of each
glossary entry to go to the page where the term is used.
A
adversarial examples (adversarial examples) Modifed testing samples which induce mis-
classifcation of a machine learning model at deployment time v, 9
adversarial success (adversarial success) Indicates reaching an availability breakdown,
integrity violations, privacy compromise, or abuse trigger (for GenAI models only)
in response to attempted adversarial attacks on the model 9
Area Under the Curve (Area Under the Curve) In ML the Area Under the Curve (AUC)
is a measure of the ability of a classifer to distinguish between classes. The higher
the AUC, the better the performance of the model at distinguishing between the two
classes. AUC measures the entire two-dimensional area underneath the R ECEIVER
O PERATING C HARACTERISTICS (ROC) curve 31
availability attack (availability attack) Adversarial attacks against machine learning which
degrade the overall model performance 9
B
backdoor pattern (backdoor pattern) A trigger pattern inserted into a data sample to induce
mis-classifcation of a poisoned model. For example, in computer vision it may be
constructed from a set of neighboring pixels, e.g., a white square, and added to
a specifc target label. To mount a backdoor attack, the adversary frst poisons
the data by adding the trigger to a subset of the clean data and changing their
corresponding labels to the target label 9
backdoor poisoning attacks (backdoor poisoning attacks) Poisoning attacks against ma-
chine learning which change the prediction on samples including a backdoor pat-
tern 9, 40
C
classifcation (classifcation) Type of supervised learning in which data labels are discrete
8
convolutional neural networks (convolutional neural networks) A Convolutional Neural
Network (CNN) is a class of artifcial neural networks whose architecture connects
neurons from one layer to the next layer and includes at least one layer performing
convolution operations. CNNs are typically applied to image analysis and classif-
cation. See [119] for further details 8, 32
D
data poisoning (data poisoning) Poisoning attacks in which a part of the training data is
under the control of the adversary 4, 8
92
NIST AI 100-2e2023
January 2024
data privacy (data privacy) Attacks against machine learning models to extract sensitive
information about training data 10
data reconstruction (data reconstruction) Data privacy attacks which reconstruct sensitive
information about training data records 10, 29
deployment stage (deployment stage) Stage of ML pipeline in which the model is deployed
on new data 8, 37
Diffusion Model (Diffusion Model) A class of latent variable generative models con-
sisting of three major components: a forward process, a reverse process, and a
sampling procedure. The goal of the diffusion model is to learn a diffusion pro-
cess that generates the probability distribution of a given dataset. It is widely used
in computer vision on a variety of tasks, including image denoising, inpainting,
super-resolution, and image generation 35
discriminative (discriminative) Type of machine learning methods which learn to discrim-
inate between classes 8
E
energy-latency attacks (energy-latency attacks) Attacks that exploit the performance de-
pendency on hardware and model optimizations to negate the effects of hardware
optimizations, increase computation latency, increase hardware temperature and
massively increase the amount of energy consumed 9, 10
ensemble learning (ensemble learning) Type of a meta machine learning approach that
combines the predictions of several models to improve the performance of the
combination 8
Expectation Over Transformation (Expectation Over Transformation) Expectation Over
Transformation (EOT) helps to strengthen adversarial examples to remain adver-
sarial under image transformations that occur in the real world, such as angle and
viewpoint changes. EOT models such perturbations within the optimization pro-
cedure. Rather than optimizing the log-likelihood of a single example, EOT uses
a chosen distribution of transformation functions taking an input controlled by the
adversary to the “true” input perceived by the classifer 18
extraction (extraction) The ability of an attacker to extract training data of a generative
model by prompting the model on specifc inputs 10
F
federated learning (federated learning) Type of collaborative machine learning, in which
multiple users train jointly a machine learning model 8
federated learning models (federated learning models) Federated learning is a method-
ology to train a decentralized machine learning model (e.g., deep neural networks
or a pre-trained large language model) across multiple end-devices without shar-
ing the data residing on each device. Thus, the end-devices collaboratively train
a global model by exchanging model updates with a server that aggregates the
updates. Compared to traditional centralized learning where the data are pooled,
93
NIST AI 100-2e2023
January 2024
federated learning has advantages in terms of data privacy and security but these
may come as tradeoffs to the capabilities of the models learned through federated
data. Other potential problems one needs to contend with here concern the trust-
worthiness of the end-devices and the impact of malicious actors on the learned
model 32
feed-forward neural networks (feed-forward neural networks) A Feed Forward Neural
Network is an artifcial neural network in which the connections between nodes is
from one layer to the next and do not form a cycle. See [119] for further details 32
fne-tuning (fne-tuning) Refers to the process of adapting a pre-trained model to perform
specifc tasks or to specialize in a particular domain. This phase follows the initial
pre-training phase and involves training the model further on task-specifc data.
This is often a supervised learning task 37
formal methods (formal methods) Formal methods are mathematically rigorous techniques
for the specifcation, development, and verifcation of software systems 20
Functional Attacks (Functional Attacks) Adversarial attacks that are optimized for a set
of data in a domain rather than per data point 16, 25
G
generative (generative) Type of machine learning methods which learn the data distribution
and can generate new examples from distribution 8
generative adversarial networks (generative adversarial networks) A generative adver-
sarial network (GAN) is a class of machine learning frameworks in which two
neural networks contest with each other in the form of a zero-sum game, where
one agent’s gain is another agent’s loss. GAN’s learn to generate new data with the
same statistics as the training set. See [119] for further details 32, 35
Generative Pre-Trained Transformer (Generative Pre-Trained Transformer) An artifcial
neural network based on the transformer architecture [304], pre-trained on large
data sets of unlabelled text, and able to generate novel human-like content. Today,
this is the predominant architecture for natural language processing tasks 35
graph neural networks (graph neural networks) A Graph Neural Network (GNN) is an
optimizable transformation on all attributes of the graph (nodes, edges, global-
context) that preserves the graph symmetries (permutation invariances). GNNs
utilize a “graph-in, graph-out” architecture that takes an input graph with informa-
tion loaded into its nodes, edges and global-context, and progressively transform
these embeddings into an output graph with the same connectivity as that of the
input graph 32
H
hidden Markov models (hidden Markov models) A hidden Markov model (HMM) is a
statistical Markov model in which the system being modeled is assumed to be a
Markov process with unobservable states. In addition, the model provides an ob-
servable process whose outcomes are ”infuenced” by the outcomes of Markov
94
NIST AI 100-2e2023
January 2024
model in a known way. HMM can be used to describe the evolution of observable
events that depend on internal factors, which are not directly observable. In ma-
chine learning it is assumed that the internal state of a model is hidden but not the
hyperparameters 32
indirect prompt injection (indirect prompt injection) Attacker technique in which a hacker
relies on an LLM ingesting a prompt injection attack indirectly, e.g., by visiting a
web page or document. Unlike its direct prompt injection sibling, the attacker in
this scenario does not directly supply a prompt, but attempts to inject instructions
indirectly by having the text ingested by some other mechanism, e.g., a plugin 38,
39, 44
integrity attack (integrity attack) Adversarial attacks against machine learning which change
the output prediction of the machine learning model 9
label fipping (label fipping) a type of data poisoning attack where the adversary is re-
stricted to changing the training labels 22
label limit (label limit) Capability in which the attacker in some scenarios does not control
the labels of training samples in supervised learning 10
logistic regression (logistic regression) Type of linear classifer that predicts the probability
of an observation to be part of a class 8
machine unlearning (machine unlearning) Technique that enables a user to request re-
moval of their records from a trained ML model. Effcient approximate unlearning
techniques do not require retraining the ML model from scratch 34
membership-inference attacks (membership-inference attacks) Data privacy attacks to
determine if a data sample was part of the training set of a machine learning model
10, 29
model control (model control) Capability in which the attacker has control over machine
learning model parameters 10
model extraction (model extraction) Type of privacy attack to extract model architecture
and parameters 10
model poisoning (model poisoning) Poisoning attacks in which the model parameters are
under the control of the adversary 8, 9, 40
95
NIST AI 100-2e2023
January 2024
model privacy (model privacy) Attacks against machine learning models to extract sensi-
tive information about the model 10
multimodal models (multimodal models) Modality is associated with the sensory modal-
ities which represent primary human channels of communication and sensation,
such as vision or touch. Multimodal models process and relate information from
multiple modalities 55
out-of-distribution (out-of-distribution) This term refers to data that was collected at a dif-
ferent time, and possibly under different conditions or in a different environment,
than the data collected to train the model 51
query access (query access) Capability in which the attacker can issue queries to a trained
machine learning model and obtain predictions 10, 39
96
NIST AI 100-2e2023
January 2024
Red Teaming (Red Teaming) NIST defnes cybersecurity red-teaming as “A group of peo-
ple authorized and organized to emulate a potential adversary’s attack or exploita-
tion capabilities against an enterprise’s security posture. The Red Team’s objective
is to improve enterprise cybersecurity by demonstrating the impacts of successful
attacks and by demonstrating what works for the defenders (i.e., the Blue Team)
in an operational environment.” (CNSS 2015 [80]) Traditional red-teaming might
combine physical and cyber attack elements, attack multiple systems, and aims to
evaluate the overall security posture of an organization. Penetration testing (pen
testing), in contrast, tests the security of a specifc application or system. In AI
discourse, red-teaming has come to mean something closer to pen testing, where
the model may be rapidly or continuously tested by a set of evaluators and under
conditions other than normal operation 51
regression (regression) Type of supervised ML model that is trained on data including
numerical labels (called response variables). Types of regression algorithms in-
clude linear regression, polynomial regression, and various non-linear regression
methods 8
reinforcement learning (reinforcement learning) Type of machine learning in which an
agent interacts with the environment and learns to take actions which optimize a
reward function 8
resource control (resource control) Capability in which the attacker has control over the
resources consumed by a ML model, particularly for LLMs and RAG applications
39, 44
Retrieval Augmented Generation (Retrieval Augmented Generation) This term refers
to retrieving data from outside a foundation model and augmenting prompts by
adding the relevant retrieved data in context. RAG allows fne-tuning and modif-
cation of the internal knowledge of the model in an effcient manner and without
needing to retrain the entire model. First, the documents and user prompts are
converted into a compatible format to perform relevancy search. Typically this is
accomplished by converting the document collection and user prompts into numer-
ical representations using embedding language models. RAG model architectures
compare the embeddings of user prompts within the vector of the knowledge li-
brary. The original user prompt is then appended with relevant context from sim-
ilar documents within the knowledge library. This augmented prompt is then sent
to the foundation model. For RAG to work well, the augmented prompt must ft
into the context window of the model 3, 36, 37, 39, 44
rowhammer attacks (rowhammer attacks) Rowhammer is a software-based fault-injection
attack that exploits DRAM disturbance errors via user-space applications and al-
lows the attacker to infer information about certain victim secrets stored in memory
cells. Mounting this attack requires attacker’s control of a user-space unprivileged
process that runs on the same machine as the victim’s ML model 32
97
NIST AI 100-2e2023
January 2024
targeted poisoning attacks (targeted poisoning attacks) Poisoning attacks against machine
learning which change the prediction on a small number of targeted samples 9, 40
testing data control (testing data control) Capability in which the attacker has control over
the testing data input to the machine learning model 10
training data control (training data control) Capability in which the attacker has control
over a part of the training data of a machine learning model 10, 39
training stage (training stage) Stage of machine learning pipeline in which the model is
trained using training data 8, 37
trojans (trojans) A malicious code/logic inserted into the code of a software or hard-
ware system, typically without the knowledge and consent of the organization that
owns/develops the system, that is diffcult to detect and may appear harmless, but
can alter the intended function of the system upon a signal from an attacker to cause
a malicious behavior desired by the attacker. For Trojan attacks to be effective, the
trigger must be rare in the normal operating environment so that it does not affect
the normal effectiveness of the AI and raise the suspicions of human users 4, 54
98
NIST AI 100-2e2023
January 2024
99