0% found this document useful (0 votes)
39 views15 pages

Feb2024 Machine Unlearning

Uploaded by

Vaishali Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Feb2024 Machine Unlearning

Uploaded by

Vaishali Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

R ETHINKING M ACHINE U NLEARNING FOR

L ARGE L ANGUAGE M ODELS

Sijia Liu1,2 Yuanshun Yao3 Jinghan Jia1 Stephen Casper4 Nathalie Baracaldo2 Peter Hase5
Xiaojun Xu3 Yuguang Yao1 Hang Li3 Kush R. Varshney2 Mohit Bansal5 Sanmi Koyejo6 Yang Liu3,7
1
arXiv:2402.08787v1 [cs.LG] 13 Feb 2024

Computer Science & Engineering Dept., Michigan State University, USA


2
IBM Research, USA
3
ByteDance Research, USA
4
Computer Science and Artificial Intelligence Laboratory, MIT, USA
5
Computer Science Dept., University of North Carolina at Chapel Hill, USA
6
Computer Science Dept., Stanford University, USA
7
Computer Science & Engineering Dept., University of California, Santa Cruz, USA

A BSTRACT
We explore machine unlearning (MU) in the domain of large language models (LLMs), referred
to as LLM unlearning. This initiative aims to eliminate undesirable data influence (e.g., sensitive
or illegal information) and the associated model capabilities, while maintaining the integrity of
essential knowledge generation and not affecting causally unrelated information. We envision LLM
unlearning becoming a pivotal element in the life-cycle management of LLMs, potentially standing
as an essential foundation for developing generative AI that is not only safe, secure, and trustworthy,
but also resource-efficient without the need of full retraining. We navigate the unlearning landscape
in LLMs from conceptual formulation, methodologies, metrics, and applications. In particular, we
highlight the often-overlooked aspects of existing LLM unlearning research, e.g., unlearning scope,
data-model interaction, and multifaceted efficacy assessment. We also draw connections between
LLM unlearning and related areas such as model editing, influence functions, model explanation,
adversarial training, and reinforcement learning. Furthermore, we outline an effective assessment
framework for LLM unlearning and explore its applications in copyright and privacy safeguards and
sociotechnical harm reduction.

1 Introduction

Large language models (LLMs) have shown exceptional proficiency in generating text that closely resembles human-
authored content. However, their ability to memorize extensive corpora may also lead to ethical and security concerns.
These include societal biases and stereotyping (Bender et al., 2021; Motoki et al., 2023; Kotek et al., 2023), the
generation of sensitive, private, harmful, or illegal content (Nasr et al., 2023; Wen et al., 2023; Karamolegkou et al.,
2023; Patil et al., 2024), ease of jailbreaking (Wei et al., 2023; Zou et al., 2023; Liu et al., 2023a), and possible
malicious use in developing cyberattacks or bioweapons (Barrett et al., 2023; Hendrycks et al., 2023). These concerns
emphasize the need to adeptly and efficiently tailor pre-trained LLMs to suit diverse safety contexts while meeting
specific requirements of users and sectors.
With the costly and prolonged training periods of LLMs, retraining these models to eliminate undesirable data effects is
often impractical. Machine unlearning (MU) has emerged as an alternative to remove the influence of undesirable data
and associated model capabilities from the pre-trained models (Cao & Yang, 2015; Bourtoule et al., 2021; Nguyen et al.,
2022; Si et al., 2023; Zhang et al., 2023a; Eldan & Russinovich, 2023; Yao et al., 2023). For example, MU is used
as a strategy to prevent the generation of copyrighted material from the Harry Potter series in (Eldan & Russinovich,
2023). In the context of classification tasks, MU has been extensively studied by Ginart et al. (2019); Neel et al.

Correspondence to: Sijia Liu ([email protected]), Yang Liu ([email protected]).


Rethinking Machine Unlearning for LLMs

Why? Evaluation Red Teaming User Feedback

Safety Alignment
LLM
Privacy Compliance Pipeline
Data Pretrained LLM Alignment Aligned LLM
Copyright Removal
Where
Pre-training Pre-alignment In-alignment Post-alignment
Bias Mitigation

How Gradient Fine- Localization Influence


Hallucination Removal
Ascent tuning Informed Function

Figure 1: Demonstration of how MU can be incorporated into LLM development cycle. The landscape of LLM
unlearning will be mainly navigated from applications (‘why’), methods (‘where’ and ‘how’), and evaluations.

(2021); Ullah et al. (2021); Sekhari et al. (2021); Golatkar et al. (2020); Jia et al. (2023). Yet, its application and
understanding in LLMs remains limited, where models are typically used for generative tasks such as summarization,
sentence completion, paraphrasing, and question answering. Therefore, this paper specifically concentrates on exploring
the MU problems in LLMs, termed ‘LLM Unlearning’.
As data-model scales continue to grow, the emergence of LLM unlearning introduces new challenges and complexities,
as will be elaborated on in Sec. 2. For example, current research efforts (Lu et al., 2022; Jang et al., 2022; Ilharco et al.,
2022; Eldan & Russinovich, 2023; Wu et al., 2023b; Yu et al., 2023; Zhang et al., 2023c; Yao et al., 2023) suffer from a
lack of standardized corpora and evaluation methods for LLM unlearning, often facing issues related to unlearning
precision and efficiency. We also refer readers to Sec. 3 for a summary of existing LLM unlearning tasks.
Although preliminary surveys of LLM unlearning have been provided by Si et al. (2023) and Zhang et al. (2023a),
this paper is, to the best of our knowledge, the first to offer a thorough and in-depth analysis of LLM unlearning. Our
contributions are summarized below. Fig. 1 provides an overview of the LLM unlearning landscape that we explore.
(1) Surveying: We conduct an in-depth review of the foundational concepts and principles of LLM unlearning, delving
into the problem formulation, categories of unlearning methods, evaluation approaches, and practical applications.
(2) Uncovering: We bring to light previously overlooked dimensions of LLM unlearning, e.g., emphasizing the
significance of precisely defining the unlearning scope, elucidating the interplay between data and model interactions,
and exploring the adversarial assessment of unlearning efficacy.
(3) Connecting: We establish connections between LLM unlearning and other relevant problems and domains, providing
a comparative analysis with related topics such as model editing, influence function, and adversarial learning.
(4) Forecasting: We offer insights into the future of LLM unlearning by identifying novel opportunities & challenges.
This paper is positioned to reassess the challenge of LLM unlearning, refining its scope across various dimensions:
conceptual formulation (Sec. 3), methods (Sec. 4), assessment (Sec. 5), and applications (Sec. 6); see the schematic
overview in Fig. 1. We delve into each dimension through surveying, uncovering, connecting, and forecasting. We
conclude that unlearning will be a valuable tool for making LLMs more trustworthy, but making more progress on this
will require updating the unlearning paradigm. We aspire for this work to pave the way for developing LLM unlearning,
illuminating its opportunities, challenges, and untapped potential.

2 Preliminaries and Related Work

LLM unlearning has garnered attention for addressing trustworthiness concerns such as toxicity (Lu et al., 2022),
copyright and privacy (Jang et al., 2022; Eldan & Russinovich, 2023; Wu et al., 2023b), fairness (Yu et al., 2023), model
editing (Ilharco et al., 2022; Zhang et al., 2023c), hallucination (Yao et al., 2023), and sensitive knowledge (Barrett
et al., 2023; Hendrycks et al., 2023). In this section, we present a succinct overview of MU, tracing its journey from
traditional ML models to the emerging challenges in LLMs.

2
Rethinking Machine Unlearning for LLMs

MU for non-LLMs. The study of MU can be traced back to non-LLMs in response to data protection regulations
such as ‘the right to be forgotten’ (Cao & Yang, 2015; Hoofnagle et al., 2019; Bourtoule et al., 2021; Nguyen et al.,
2022). Due to its capability of assessing data influence on model performance, the landscape of MU has expanded to
encompass diverse domains, such as image classification (Ginart et al., 2019; Golatkar et al., 2020; Neel et al., 2021;
Ullah et al., 2021; Sekhari et al., 2021), text-to-image generation (Gandikota et al., 2023; Zhang et al., 2023b; Kumari
et al., 2023; Fan et al., 2024), federated learning (Liu et al., 2020; Wang et al., 2022; Che et al., 2023; Liu et al., 2023b;
Halimi et al., 2022), and graph neural networks (Chen et al., 2022; Chien et al., 2022; Wu et al., 2023a).
In the literature, ‘exact’ unlearning, which involves retraining the model from scratch after removing specific training
data points, is often considered the gold standard. However, this approach comes with significant computational
demands and requires access to the entire training set (Thudi et al., 2022). To address these challenges, many research
efforts have shifted towards the development of scalable and effective approximate unlearning methods (Golatkar et al.,
2020; Warnecke et al., 2021; Becker & Liebig, 2022; Thudi et al., 2022; Jia et al., 2023; Chen et al., 2023a). In addition,
probabilistic methods with certain provable removal guarantees have been explored, often leveraging the concept of
differential privacy (Ginart et al., 2019; Guo et al., 2019; Neel et al., 2021; Ullah et al., 2021; Sekhari et al., 2021).

Challenges of MU for LLMs. LLM unlearning introduces new challenges and complexities.
First, LLMs are trained on massive amounts of data, which can unintentionally introduce biases and the memorization
of personal and confidential information. Accordingly, it becomes challenging to precisely define and localize the
‘unlearning targets’, such as the subset of the training set or a knowledge concept that needs to be removed. Therefore,
current studies on LLM unlearning (Lu et al., 2022; Jang et al., 2022; Ilharco et al., 2022; Eldan & Russinovich, 2023;
Wu et al., 2023b; Yu et al., 2023; Zhang et al., 2023c; Yao et al., 2023) are typically context and task-dependent. There
is a lack of standardized corpora for LLM unlearning.
Second, the growing size of LLMs and the rise of black-box access to LLM-as-a-service present challenges in developing
scalable and adaptable MU techniques to LLMs (Bucknall & Trager, 2023; Casper et al., 2024). This also affects
performance evaluation, given the absence of retraining as a benchmark. To address these challenges, previous studies
have proposed approaches like in-context unlearning (Pawelczyk et al., 2023) and fictitious unlearning (Maini et al.,
2024), where the former enables unlearning on black-box models, and the latter provides alternatives to retraining.
Third, the scope of unlearning is often underspecified for LLMs. This issue is similar to challenges faced in model
editing (Mitchell et al., 2022). For instance, effective unlearning should ensure that LLMs delete knowledge of the
targeted data within the predefined scope while simultaneously maintaining its utility for data outside of this scope.
Fourth, despite the potential of LLM unlearning in diverse applications, there is a notable absence of comprehensive
and reliable evaluation. For example, recent studies (Shi et al., 2023; Patil et al., 2024) have demonstrated that sensitive
information can be reverse-engineered from an edited model, even if efforts were made to delete this information. This
highlights the need for thorough and adversarial evaluations and the design of more mechanistic methods to guarantee
the authenticity of unlearning.

3 Unpacking LLM Unlearning


In this section, we formalize the LLM unlearning problem and delve into its configuration. We will outline the inherent
facts and complexities in its formulation.

Problem statement. In light of the literature on unlearning (Bourtoule et al., 2021; Jia et al., 2023; Kurmanji et al.,
2023), and its progression in LLMs (Pawelczyk et al., 2023; Yao et al., 2023; Ishibashi & Shimodaira, 2023; Maini
et al., 2024), we define the problem of LLM unlearning below.
(LLM unlearning) How can we efficiently and effectively eliminate the influence of specific ‘unlearning targets’
and remove associated model capabilities while preserving model performance for non-targets?

We dissect the above statement from the perspectives: (1) unlearning targets, (2) influence erasure, (3) unlearning
effectiveness, and (4) efficiency. See Table 1 for a summary of existing LLM unlearning studies based on (1)-(4).
• Unlearning targets: Unlearning tasks may take on various forms and are closely related to the unlearning objectives.
For instance, one could focus on data influence removal, while the other could emphasize model capability removal.
Although these two aspects are intertwined, the former is often crucial for intellectual property (IP) protection, while the
latter is more practical for AI alignment and safety. The literature identifies unlearning targets as specific data points,
which could involve content containing harmful, unethical, or illegal language (Jang et al., 2022; Wu et al., 2023b).
They have also been represented by higher-level unlearned knowledge, expressed through an unwanted text prompt or

3
Rethinking Machine Unlearning for LLMs

Table 1: A summary of existing LLM unlearning problems through unlearning targets, influence erasure, effectiveness,
and efficiency. An asterisk (∗ ) indicates the incapability of evaluating unlearning for LLMs due to the impracticality of
retraining these models.
Effectiveness:
Related work Unlearning targets/tasks Influence erasure methods (I) In-scope evaluation for unlearning efficacy Efficiency
(O) Out-of-scope evaluation for model utility
Reducing toxic content, (I) Toxic prompts, specific sentiments,
(Lu et al., 2022) avoiding undesirable sentiments, Reward-reinforced model fine-tuning & repetitive sentences N/A
and preventing repeated text generation (O) Unlearning target-irrelevant prompts
Degenerating private information, (I) Prompts from training data extraction
(Jang et al., 2022) Gradient ascent-based fine-tuning Runtime cost
w/ unlearning response irrelevant to this info (O) Natural language understanding tasks
Text de-classification, w/ unlearning Sharded, isolated, sliced, and (I) No evaluation for unlearning efficacy Runtime cost
(Kumar et al., 2022)
response close to that of retraining⋆ aggregated (SISA) training via adapter (O) Test set Memory cost
(Ilharco et al., 2022) Task vector-based parameter-efficient (I) Prompts leading to toxic generation
Degenerating toxic content N/A
(Zhang et al., 2023c) fine-tuning via LoRA (O) Perplexity on other datasets
Text de-classification/de-generation,
(I) Training subset
(Wang et al., 2023) unlearning specific words in translation, KL-divergence-based fine-tuning Runtime cost
(O) Test set
w/ response close to that of retraining∗
Unlearning gender and profession bias, Weight importance-informed (I) Biased prompts
(Yu et al., 2023) N/A
with de-biased unlearning response & relabeling-based fine-tuning (O) No evaluation for model utility
Text de-classification, w/ unlearning (I) Training subset
(Pawelczyk et al., 2023) In-context learning Black-box access
response close to that of retraining∗ (O) Retain & test sets
Degenerating Harry Potter-related (I) Questions and their rephrased/hard versions
(Eldan & Russinovich, 2023) book content, w/ unlearning response Relabeling-based fine-tuning about Harry Potter N/A
irrelevant to Harry Potter (O) NLU tasks
(I) Adversarial and original questions
Unlearning knowledge from QA dataset,
(Ishibashi & Shimodaira, 2023) Relabeling-based fine-tuning about forgotten knowledge N/A
with refusal response (e.g., ‘I don’t know’)
(O) Other QA prompts
Text de-classification and de-generation, KL divergence-based parameter- (I) Training subset
(Chen & Yang, 2023) Runtime cost
with response close to that of retraining∗ efficient fine-tuning via adapter (O) Retain & test sets
Degenerating private information, (I) Memorized private data points
(Wu et al., 2023b) Importance-based neuron editing Runtime cost
w/ unlearning response irrelevant to this info (O) Test set
Degenerating harmful prompts, Integration of gradient ascent,
(I) Prompts related to unlearning targets
(Yao et al., 2023) degenerating Harry Potter-related random labeling, Runtime cost
(O) NLU tasks
book content, and reducing hallucination & KL divergence-based fine-tuning
Unlearning biographical knowledge (I) Q&A about the unlearning authors
(Maini et al., 2024) Fine-tuning with various objectives Runtime cost
about fictitious authors. (O) Q&A about other authors and general facts
Degenerating sensitive information Model editing techniques and (I) Prompts for unlearned factual knowledge White-box v.
(Patil et al., 2024)
using factual information as a testbed constrained finetuning (O) Prompts for unrelated factual knowledge black-box access

concept (Lu et al., 2022; Yao et al., 2023; Eldan & Russinovich, 2023). For example, Eldan & Russinovich (2023)
defined the unlearning target as ‘Harry Potter’-related content, with the objective to avoid generating such content
irrespective of where the content was learned: from the copyrighted material, blog posts, or news articles.
• Influence erasure: Erasing the influence of unlearning targets and associated model capabilities requires a joint
examination of both data and model influences rather than studies in isolation. Specifically, it is important to scrutinize
the contributions of data sources to undesired model outputs, as well as the roles played by individual components
within a model in generating these undesirable outcomes. This dual examination allows us to gain a more comprehensive
understanding of the mechanisms driving these outputs, thereby facilitating the development of unlearning strategies
to prevent them effectively. The objective of achieving complete influence erasure also implies the importance
of robustness and generalization in unlearned behavior. When evaluating LLM unlearning, especially when using
approximate methods shown in Table 1, a rigorous criterion is needed. Patil et al. (2024) underscore this viewpoint by
demonstrating that sensitive information can persist in model weights after editing or unlearning, and by Maini et al.
(2024) and Shi et al. (2023) who expose the limitations of conventional unlearning methods.
• Unlearning effectiveness: The effectiveness of LLM unlearning extends beyond merely diminishing the specific data
influence or model capability. A crucial aspect of effectiveness is the concept of unlearning scope, as inspired by the
editing scope (Mitchell et al., 2022). This has implications for the development of metrics to evaluate the effectiveness
of unlearning and the design of effective unlearning methods. The unlearning scope defines the success of influence
erasure by its capacity to accurately modify the model’s behavior for in-scope examples, which are either directly or
indirectly related to unlearning targets, such as paraphrased text prompts to be degenerated (Zhong et al., 2023; Patil
et al., 2024). At the same time, it must also ensure the generation consistency for out-of-scope examples, necessitating
the preservation of model generation capabilities post-unlearning. Differentiating between in-scope and out-of-scope
examples for unlearning is its own unsolved problem, as it requires determining when facts logically imply one another
(Hase et al., 2023b; Cohen et al., 2023).
• Unlearning efficiency & feasibility: Current research efforts focused on the development of rapid unlearning methods
for pre-trained LLMs due to the significant re-training cost (Jang et al., 2022; Eldan & Russinovich, 2023; Yao et al.,
2023). However, beyond computational efficiency, LLMs present additional efficiency challenges. These include the

4
Rethinking Machine Unlearning for LLMs

complexity and, at times, the infeasibility of pinpointing and attributing training data points designated for unlearning.
Additionally, there is the challenge of executing unlearning in the context of black-box LLMs (Pawelczyk et al., 2023),
where interactions with models are constrained to input-output queries.

Distinguishing LLM unlearning from conventional MU. According to the above dimensions, LLM unlearning
involves a broader range of targets, which are often context-dependent and less clearly defined. Moreover, the
effectiveness of LLM unlearning is not limited to forgetting the influence of specific data points but also includes
defining a broader unlearning scope for model capability removal. Furthermore, there is a critical need to devise
more mechanistic methods that guarantee effective and robust unlearning, while also enhancing their practicality and
feasibility.

Mathematical modeling. Building upon the high-level formulation of LLM unlearning problems presented earlier,
we next provide mathematical modeling details and discuss the associated design choices. To facilitate comprehension,
we provide a commonly-used formulation of LLM unlearning problems below. While this may not be the sole or
optimal approach to LLM unlearning, it incorporates several key elements that are essential to the problem setup.
min E(x,yf )∈Df [ℓ(yf |x; θ)] +λ E(x,y)∈Dr [ℓ(y|x; θ)]
θ | {z } | {z } (1)
Forget Retain

where ℓ(y|x; θ) denotes the prediction loss of using θ given the input x with respect to the response y, Df and Dr refer
to ‘forget’ and ‘retain’ sets which will be explained later, yf denotes the desired model response post-unlearning, and
λ ≥ 0 is a regularization parameter to balance ‘forget’ and ‘retain’ (e.g., λ = 0 if retain set is not given a priori).
In the dataset setup of LLM unlearning, we typically assume access to a forget set (Df ), the influence of which should
be eliminated in LLM generation. For instance, Df might consist of a collection of harmful or toxic prompt-response
pairs designated for degeneration (Yao et al., 2023). Moreover, if the original training set is available, then Df can
be composed of a subset of training data points related to the unlearning target. Alternatively, it can be generated
using synthesized data points based on a higher-level unlearned knowledge concept, or it can be derived from a set of
extracted training data points reverse-engineered from the given LLM itself. In practice, the forget set Df is not required
to belong precisely to the LLM’s training corpus. Given the diversity and extensive size of the model’s training data,
the content we aim to unlearn is more likely to represent a general concept. Thus, LLM unlearning needs to not only
unlearn specific training samples but also generalize to similar samples that share common characteristics.
Besides the forget set Df , sometimes there is a need for a retain set (Dr ), which contains samples that are not subject
to unlearning and used to preserve the utility of the unlearned model. Through the lens of the unlearning scope we
discussed earlier, the forget set (Df ) provides in-scope examples earmarked for unlearning, while the retain set (Dr )
involves examples out of the unlearning scope. Some recent studies have also attempted to develop LLM unlearning
approaches that operate independently of access to forget and/or retain sets (Pawelczyk et al., 2023; Li et al., 2023b).
We next introduce the model and optimization setups for LLM unlearning. Unlearning is performed at the post-model
training phase. As shown in (1), a common unlearning objective is to efficiently update the original pre-trained model
so that the updated model can unlearn on Df while retaining its generation capability on Dr . In addition, another design
element is unlearning response (yf ), referred to as the response of an unlearned model to in-scope examples. For
example, in the stateful LLM unlearning method aimed at erasing information related to ‘Who’s Harry Potter?’ (Eldan &
Russinovich, 2023), the unlearning response is based on word replacements using generic translations, like substituting
‘Quidditch’ with ‘Skyball’, as part of the unlearning process. However, this type of approach may inadvertently blur
the line between LLM hallucination and legitimate responses, highlighting the need for improvements in unlearning
response design. Another choice is empty response (Wu et al., 2023b; Patil et al., 2024), given by the reject ‘I don’t
know’ (Patil et al., 2024) or the customized response by ‘masking’ the unlearning information using specialized tokens
such as ‘■’ in (Wu et al., 2023b). However, we need to ensure that the empty response targets only examples within the
unlearning scope. Otherwise, frequent rejections may occur, potentially diminishing the user experience with LLMs.

4 LLM Unlearning Methods Revisited: Prior Art and Overlooked Principles


In this section, we begin with a review of existing MU (machine unlearning) methodologies adapted for LLM unlearning.
Following that, we delve into the design factors that are often overlooked in the literature, including the intertwined
data-model influence, the relationship with model editing, and adversarial training.

Review of existing unlearning principles. Existing LLM unlearning methods can be broadly categorized into
two groups: model-based and input-based. Model-based methods involve modifying the weights and/or architecture

5
Rethinking Machine Unlearning for LLMs

components of LLMs to achieve the unlearning objective (Jang et al., 2022; Lu et al., 2022; Yao et al., 2023; Yu
et al., 2023; Chen & Yang, 2023; Zhang et al., 2023c; Hase et al., 2023a; Wu et al., 2023b; Rafailov et al., 2023), e.g.,
following the mathematical formulation in Sec. 3. Input-based methods design input instructions (Madaan et al., 2022;
Zheng et al., 2023; Pawelczyk et al., 2023), such as in-context examples or prompts, to guide the original LLM (without
parameter updating) towards the unlearning objective.
In the literature, the predominant research emphasis lies on model-based methods as shown in Tab. 1. We review some
of the most representative approaches below.
• Gradient ascent and its variants (Jang et al., 2022; Yao et al., 2023; Yu et al., 2023): Gradient ascent (GA) stands as
one of the most straightforward unlearning methods, updating the model parameters by maximizing the likelihood of
mis-prediction for the samples within the forget set Df (Jang et al., 2022). However, it is worth noting that GA alone can
be sensitive to the choice of hyperparameters during optimization, such as the number of ascent steps and the learning
rate (Jia et al., 2023; Fan et al., 2024). This has given rise to improved variants of GA. For example, Yao et al. (2023)
incorporate random labeling to augment the unlearning objective and ensure utility preservation on the retain set Dr .
Another variant of GA is to transform it into a gradient descent approach, which aims to minimize the likelihood of
predictions on relabeled forgetting data (Yu et al., 2023). This gradient descent-based fine-tuning, over relabeled or
randomly labeled forgetting data, is also employed in (Eldan & Russinovich, 2023), where generic translations are used
to replace the unlearned texts. GA and its variants often involve fine-tuning pre-trained LLMs for unlearning purposes.
To enhance efficiency, parameter-efficient fine-tuning (PEFT) techniques could be employed. Chen & Yang (2023)
fine-tune an adapter over the unlearning objective that acts as an unlearning layer within the LLM. Zhang et al. (2023c)
use LoRA to create task vectors and accomplish unlearning by negating tasks under these task vectors.
• Localization-informed unlearning (Meng et al., 2022; Yu et al., 2023; Wu et al., 2023b): The pursuit of parameter
efficiency is also in line with the objective of identifying and localizing a subset of model units (e.g., layers, weights, or
neurons) that are essential for the unlearning task. For example, the process of localization can be accomplished through
representation denoising, also known as causal tracing, in (Meng et al., 2022), focusing on the unit of model layers.
Previous work shows that it is important to delete information about unlearning targets wherever it is represented in
models in order to protect against attacks (Patil et al., 2024). In addition, gradient-based saliency (Yu et al., 2023) is
employed to identify the crucial weights that need to be fine-tuned to achieve the unlearning objective. In (Wu et al.,
2023b), neurons that respond to unlearning targets are identified within the feed-forward network and subsequently
selected for knowledge unlearning. In the context of vision models, unlearning can also benefit from localizing
weights salient to unlearning, as demonstrated by Jia et al. (2023) and Fan et al. (2024). Furthermore, the concept of
localization-informed unlearning resonates with the future modular machine learning solution development (Menik
& Ramaswamy, 2023). This modularity allows the emerging foundation models to be partitioned into manageable
subparts, facilitating easier maintenance and independent updates for each component.
• Influence function-based methods: While the influence function (Koh & Liang, 2017; Bae et al., 2022) is a standard
approach to assess the effect of data removal on model performance (Izzo et al., 2021; Warnecke et al., 2021), it is not
commonly employed in the context of LLM unlearning for two main reasons: the computational complexity involved in
inverting the Hessian matrix, and the reduced accuracy resulting from the use of approximations in influence function
derivation (Jia et al., 2023). However, the potential of influence functions in LLM unlearning may be underestimated,
given that scalability issue has been improved by Grosse et al. (2023), and approximation errors can be mitigated by
focusing on localized weights that are salient to unlearning, as described in the previous category.
• Other model-based methods: Other types of LLM unlearning methods fall outside the categories mentioned above.
For example, Jang et al. (2022); Chen & Yang (2023) show that sequential unlearning outperforms batch unlearning
However, as indicated by Gu et al. (2024), sequential editing of LLMs may compromise their general capabilities.
Therefore, the study of sequential unlearning in LLMs continues to be an unresolved issue.
• Input-based vs. model-based: Input-based strategies (Madaan et al., 2022; Zheng et al., 2023; Pawelczyk et al., 2023)
have also shown promise in addressing the challenges posed by the restricted access to black-box LLMs and achieving
parameter efficiency of LLM unlearning. Here the learnable parameters are given by input prompts rather than model
weights/architecture components. However, we posit that input-based methods may not necessarily yield genuinely
unlearned models, leading to weaker unlearning strategies compared to model-based methods because modifying
the inputs of LLMs alone may not be sufficient to completely erase the influence of unlearning targets (Toyer et al.,
2023). This assertion is also supported by the existence of hard or even adversarial in-scope examples associated with
unlearning targets and the challenge to remove their influence in LLMs (Zhong et al., 2023; Patil et al., 2024).

Exploring data-model interactions. A key objective of unlearning is to eliminate the influence of the forgotten
data points/knowledge on the model’s performance. However, this process is not studied in isolation: It is closely

6
Rethinking Machine Unlearning for LLMs

connected to exploring the influence of model weights or architecture components. Unlearning requires a sense of
locality, which involves addressing the specific unlearning target and its associated unlearning scope. Consequently,
exploring model influence helps identify the specific, localized areas of the model that are relevant to this locality. This
is further reinforced by the surveyed weight localization techniques (Meng et al., 2022; Yu et al., 2023; Wu et al., 2023b;
Patil et al., 2024). Thus, model influence and data influence are intertwined in LLM unlearning, and a comprehensive
understanding of the former can streamline the process of handling data influence.

Relationship with model editing. Model editing, closely related to LLM unlearning, focuses on the local alteration
of pre-trained models’ behavior to introduce new knowledge or rectify undesirable behaviors. First, the objective of
editing could align with that of unlearning when editing is introduced to erase information. Second, like unlearning
scope, editing scope (Mitchell et al., 2022; Hase et al., 2023b; Cohen et al., 2023) is crucial to ensure that editing is
executed without compromising the generative capabilities of the model outside the defined scope. Third, both model
editing and unlearning can be approached using the ‘locate first, then edit/unlearn’ principle. Localization in the context
of model editing has also been applied to various elements, including neurons (Dai et al., 2021), network layers (Meng
et al., 2022; Gupta et al., 2023), and feed-forward components of LLMs (Geva et al., 2020; Li et al., 2023c).
Despite the aforementioned connections, there are clear distinctions between LLM unlearning and editing. First,
the unlearning response is sometimes unknown compared to the editing response. The specificity of an incorrect
or improper unlearning response could be seen as a form of LLM hallucination after unlearning. Second, although
unlearning and model editing may share some common algorithmic foundations, the former does not create new answer
mappings. Rather, its central aim is the comprehensive elimination of the influence attributed to a specific knowledge
or concept within a pre-trained LLM. Third, we can differentiate model editing from unlearning from the perspective
of ‘working memory’. It is known in (Li et al., 2022) that working memory in LLMs is maintained through neuron
activations rather than weight-based long-term memory. Thus, existing memory-based model editing techniques (Li
et al., 2022; Mitchell et al., 2022; Madaan et al., 2022; Zheng et al., 2023) focus on updating short-term working
memory instead of altering the long-term memory encapsulated in the model’s weights. Yet, we posit that unlearning
requires more mechanistic approaches that facilitate ‘deep’ modifications to pre-trained LLMs.

Adversarial training for robust unlearning. An increasing body of research highlights the weaknesses of existing
unlearning methods (Shi et al., 2023; Maini et al., 2024), particularly in their vulnerability to test-time adversaries
attempting to jailbreak unlearned models. This issue has been explored by Patil et al. (2024) for LLMs and Zhang
et al. (2023d) for diffusion models, and inspires us to integrate adversarial training (Madry et al., 2017) into the
unlearning process, resulting in what we term adversarial unlearning. However, this approach has received relatively
little attention thus far. While adversarial unlearning increases training costs, it also presents new opportunities. For
instance, localization-informed unlearning can significantly reduce the computation expenses associated with adversarial
unlearning by focusing on a small portion of model units for updating. Furthermore, advanced adversarial training
techniques such as fast adversarial training (Madry et al., 2017; Shafahi et al., 2019) and generalized adversarial training
(Zhu et al., 2019; Kumari et al., 2019; Zhang et al., 2022; Robey et al., 2023) have the potential to enhance the scalability
of adversarial unlearning while preserving its effectiveness.

Reinforcement learning. The mainstream technique for aligning LLMs with human values is RLHF (reinforcement
learning from human feedback) and its variants (Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022; Yuan et al.,
2023; Lee et al., 2023; Rafailov et al., 2023; Casper et al., 2023). However, RLHF is resource-intense: (1) it requires
human inputs that are expensive to collect, and (2) it is computationally costly (i.e., the standard three-stage aligning
procedure). LLM unlearning arises as an alternative aligning method, where collecting negative (i.e., low-quality and
harmful) samples is much easier through user reporting or (internal) red teaming than positive (i.e., high-quality and
helpful) samples which often require hiring humans. Furthermore, we can envision a reinforced unlearning paradigm
with a properly defined reward function for the unlearned tasks. In a similar vein, Rafailov et al. (2023); Lu et al. (2022)
utilize a reward model to assist in unlearning.

5 Assessment
In this section, we discuss key considerations for designing an effective evaluation pipeline for LLM unlearning.

Tasks and benchmarks. Datasets related to harmful content degeneration, personal identification information
removal, and copyrighted information prevention serve as suitable benchmarks for evaluating the effectiveness of LLM
unlearning. Some notable examples of these datasets include: The Enron dataset, which comprises employee emails
publicly disclosed during Enron’s legal investigation by the Federal Energy Regulatory Commission (Wu et al., 2023b),
the Training Data Extraction Challenge dataset used in (Jang et al., 2022), the Harry Potter book series dataset (Eldan &

7
Rethinking Machine Unlearning for LLMs

Russinovich, 2023; Shi et al., 2023), the toxicity generation dataset (Lu et al., 2022; Gehman et al., 2020), and the TOFU
dataset for fictitious unlearning (Maini et al., 2024).

Evaluation of unlearning effectiveness. We examine efficacy from three perspectives: comparison with retraining
(i.e., the gold standard of unlearning), ‘hard’ in-scope evaluation or robustness, and training data detection.
• LLM unlearning vs. retraining. In classic unlearning paradigms (Golatkar et al., 2020; Thudi et al., 2022; Jia et al.,
2023; Fan et al., 2024), retraining a model from scratch after removing the forgotten data from the original training set
is regarded as exact unlearning. However, the scalability challenges of retraining LLMs make it difficult to establish a
performance upper bound for evaluating LLM unlearning. A recent solution by Maini et al. (2024) is to incorporate
fictitious data (synthetic author profiles) into the model training paradigm. Since the injected set never appeared in the
original pretraining set, LLM fine-tuning can simulate the retraining process over the newly-introduced set. Despite the
progress with regard to the specialized TOFU dataset (Maini et al., 2024), there is still a pressing need to advance current
evaluation metrics and pipelines to accurately assess the gap between (approximate) LLM unlearning methods and
exact unlearning.
• ‘Hard’ in-scope evaluation or robustness. As demonstrated in Sec. 3, unlearning is generally context and task-
dependent, contingent upon an unlearning scope. Another effectiveness metric of LLM unlearning is to ensure
forgetting concerning in-scope unlearned examples, even for those ‘hard’ ones that fall within the unlearning scope but
may not be directly associated with the unlearning targets. The assessment of ‘hard’ in-scope examples can be achieved
by techniques such as paraphrasing what LLMs intend to forget or creating multi-hop questions (Zhong et al., 2023).
Evaluating ‘hard’ in-scope examples aligns seamlessly with the underlying principles of ‘worst-case’ or ‘adversarial’
evaluation methods for unlearning, resembling LLM jailbreaking attacks (Zhang et al., 2023d; Yong et al., 2023; Patil
et al., 2024). For instance, it is shown by Yong et al. (2023) that unlearning a scope using an English-only example
would not guarantee a similar unlearned outcome when translated into other languages. It is also crucial to evaluate the
robustness of unlearned LLMs after fine-tuning. Recent studies have revealed that fine-tuning LLMs can sometimes
lead to the re-emergence of behaviors that were not anticipated (Yang et al., 2023; Qi et al., 2023; Lermen et al., 2023;
Yong et al., 2023).
• Training data detection or membership inference. Membership inference attacks (MIA) (Shokri et al., 2017), designed
to detect if a data point is part of a victim model’s training set, serve as a crucial data privacy-unveiled metric for
evaluating machine unlearning methods (Thudi et al., 2022; Jia et al., 2023). This metric gains even more significance in
the context of LLM unlearning, particularly when retraining is not an option. This concept is also connected to training
data memorization (Carlini et al., 2022), as well as training data extraction attacks (Nasr et al., 2023) in LLMs. In the
realm of LLM unlearning, privacy-related evaluation metrics have been explored and considered in various studies (Shi
et al., 2023; Wu et al., 2023b; Jang et al., 2022; Pawelczyk et al., 2023; Maini et al., 2024).

Utility preservation. Another crucial metric is to ensure the retained generation capabilities of unlearned LLMs on
standard language modeling tasks that fall outside the unlearning scope. For example, evaluation on natural language
understanding tasks (Jang et al., 2022; Eldan & Russinovich, 2023; Yao et al., 2023) and perplexity (Ilharco et al.,
2022; Zhang et al., 2023c) has been considered in the literature. In line with evaluating the effectiveness of LLM
unlearning on ‘hard’ in-scope examples, it is equally crucial to assess utility preservation using ‘hard’ out-of-scope
examples, achieved by e.g., using data transformations. Lastly, we note that it can be difficult to determine the exact
scope for some unlearning target (Hase et al., 2023b; Cohen et al., 2023), so part of the challenge here is deciding which
generation capabilities should be retained in the first place.

Efficiency. Computation cost has been a predominant efficiency metric when evaluating LLM unlearning methods, as
demonstrated in Table 1. In addition to that, efforts have been made to extend LLM unlearning to black-box models,
without access to model parameters, as demonstrated by Pawelczyk et al. (2023). Furthermore, memory efficiency could
also serve as a crucial efficiency metric. The distinction from parameter efficiency is that current parameter-efficient
fine-tuning methods still impose substantial memory costs for storing LLMs and for executing back-propagation
(Malladi et al., 2023). Thus, a future research direction is to explore memory-efficient fine-tuning methods for LLM
unlearning.

6 Applications of LLM Unlearning

In this section, we discuss two application categories facilitated by LLM unlearning: the first focused on data influence
and the second on model capabilities.

8
Rethinking Machine Unlearning for LLMs

Copyright and privacy protection. One application of unlearning involves legal and ethical considerations around
the fair use of training data. Algorithmic disgorgement is the term applied in law and policy for the requirement put on
a company by a regulator, such as the Federal Trade Commission (FTC) in the United States, to completely destroy
a model that was trained on data without legal consent (Li, 2022; Goland, 2023; Belkadi & Jasserand, 2023; Achille
et al., 2023). The most famous case to-date is the FTC calling for the destruction of a weight loss application by WW
International, whose underlying model contained illegal health information from children. Unlearning presents a viable
alternative to complete disgorgement by removing the effect of the illegal data.
Also, the tension between data owners (e.g., authors) and LLM service providers is escalating, leading to legislation
such as legal disputes involving OpenAI, Meta, and New York Times (Small, 2023; Grynbaum & Mac, 2023). This
trend is likely to persist due to increasing societal concerns about AI data usage. The need for copyright-protected
content removal aligns with the capabilities of LLM unlearning. However, it is often challenging to pinpoint the exact
sources of training data that need to be deleted, giving rise to the issue of data attribution (Li et al., 2023a). For example,
the leakage related to the ‘Harry Potter’ series (Eldan & Russinovich, 2023) can have multiple possible causes, e.g., the
books were used in the LLM’s training data, the training data containing online discussions related to the series, or the
LLM using retrieval-augmented generation (Gao et al., 2023) which might lead to leakage from the search results.
Similar to deleting copyrighted information from the training data, another scenario is preventing LLMs from leaking
user privacy, especially personal identification information (PII). This concern is closely related to LLM memorization
and training data extraction (Carlini et al., 2019, 2021, 2022; Jang et al., 2022; Nasr et al., 2023).

Sociotechnical harm reduction. Another application of LLM unlearning is alignment (Ouyang et al., 2022), aimed at
aligning LLMs with human instructions and making sure generated text conforms to human values. Unlearning can be
used to forget harmful behaviors such as the production of toxic, discriminatory, illegal, or morally undesirable outputs
(Shevlane et al., 2023), e.g., instructions to build CBRN (chemical, biological, radiological, and nuclear) weapons.
Unlearning, as a safety alignment tool, can happen at the different stages of LLM development, e.g., before, during, or
after alignment. Current research has focused on the ‘pre-alignment’ stage (Yao et al., 2023), there may be untapped
opportunities in the others.
Hallucinations, which involve the generation of false or inaccurate content that may appear plausible, are a significant
challenge in LLMs. Previous research has demonstrated that unlearning can reduce LLM hallucinations by targeting
and unlearning factually incorrect responses given specific questions (Yao et al., 2023). Since hallucination is likely
to be caused by multiple sources, the possible usage is to unlearn factually incorrect data that serve as the source of
commonly shared hallucinations or misconceptions.
LLMs are known to generate biased decisions and outputs (Perez et al., 2022; Tamkin et al., 2023; Cui et al., 2023). In
the vision domain, unlearning has proven to be an effective tool for reducing discrimination to enable fair decision-
making (Sattigeri et al., 2022; Chen et al., 2023b). In the language domain, unlearning has been applied to mitigate
gender-profession bias (Yu et al., 2023) and many other fairness issues (Sattigeri et al., 2022; Oesterling et al., 2023;
Kadhe et al., 2023). However, more opportunities exist, such as unlearning stereotypes in training data.
LLMs are also known to be vulnerable to jailbreaking attacks (Wei et al., 2023; Qi et al., 2023; Huang et al., 2023)
(i.e., adversarially crafted prompts that lead the LLM to generate undesired outputs) as well as poisoning/backdoor
attacks (Rando & Tramèr, 2023; Carlini et al., 2023). Unlearning can be a natural solution for both types of attacks
given the existing success of unlearning as a defense against adversarial attacks in other domains (Wang et al., 2019; Li
et al., 2021; Liu et al., 2022; Jia et al., 2023).

7 Discussion and Conclusion

This position paper rethinks the paradigm of unlearning for modern LLMs to uncover its under-explored aspects. To
achieve this, we dissect LLM unlearning into four essential aspects: formulation, methodologies, evaluation metrics,
and applications. We show that there are considerable challenges in both foundational research and practical, use
case-driven research. These include: (Generality) A desired solution for LLM unlearning should take into account the
generality of the unlearning target and dataset choice, accommodate various model setups including both white-box
and black-box scenarios, and consider the specifics of the unlearning method. (Authenticity) LLM unlearning should
focus on effectively removing both data influence and specific model capabilities, in order to authenticate unlearning
across a range of evaluation methods, particularly in adversarial contexts. (Precision) LLM unlearning should precisely
define the scope of unlearning, while ensuring the preservation of general language modeling performance outside this
unlearning scope.

9
Rethinking Machine Unlearning for LLMs

By examining the current state of the art, we gain insights for the future development of LLM unlearning. For example,
localization-informed unlearning shows promise with possible dual advantages of efficiency and efficacy. Effective
unlearning requires careful consideration of data-model influences and adversaries. Despite similarities between LLM
unlearning and model editing, they differ in their formulation and methodological design. Furthermore, insights gained
from the study of LLM unlearning could catalyze technological advancements in other types of foundation models, e.g.,
large vision-language models.

8 Broader Impacts
The broader impact of our work, particularly its ethical implications and societal consequences, revolves around the
responsible use of data and maintaining the integrity of large language models (LLMs) and other foundational AI models.
Our research aims to explore machine unlearning methods to enhance safety and security, uphold user privacy, reduce
biases, and guarantee the generation of trustworthy information by AI systems. The methodologies and frameworks we
explored have the potential to significantly influence ethical AI practices in the future. We recognize, however, that the
societal implications of progress in machine unlearning are intricate and diverse. This necessitates ongoing dialogue
and scrutiny, especially in relation to the dynamic field of AI ethics and governance. As machine unlearning finds
application in practical scenarios, it is essential to thoughtfully address any ethical dilemmas and societal ramifications
that may arise. This underscores our dedication to advancing AI technology in a manner that is both responsible and
mindful of its broader impacts.

9 Acknowledgements
S. Liu, J. Jia, and Yuguang Yao were supported by the National Science Foundation (NSF) Robust Intelligence (RI)
Core Program Award IIS-2207052 and the Cisco Research Faculty Award. P. Hase and M. Bansal were supported by
NSF-CAREER Award 1846185, NSF-AI Engage Institute DRL-2112635, DARPA MCS Grant N66001-19-2-4031, and
Google PhD fellowship.

References
Achille, A., Kearns, M., Klingenberg, C., and Soatto, S. Ai model disgorgement: Methods and choices. arXiv preprint
arXiv:2304.03545, 2023.
Bae, J., Ng, N., Lo, A., Ghassemi, M., and Grosse, R. B. If influence functions are the answer, then what is the question?
Advances in Neural Information Processing Systems, 35:17953–17967, 2022.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon,
C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
Barrett, C., Boyd, B., Bursztein, E., Carlini, N., Chen, B., Choi, J., Chowdhury, A. R., Christodorescu, M., Datta, A.,
Feizi, S., et al. Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy
and Security, 6(1):1–52, 2023.
Becker, A. and Liebig, T. Evaluating machine unlearning via epistemic uncertainty. arXiv preprint arXiv:2208.10836,
2022.
Belkadi, L. and Jasserand, C. From algorithmic destruction to algorithmic imprint: Generative ai and privacy risks
linked to potential traces of personal data in trained models. In ICML Workshop on Generative AI + Law, 2023.
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language
models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.
610–623, 2021.
Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N.
Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, 2021.
Bucknall, B. S. and Trager, R. F. Structured access for third-party research on frontier ai models: Investigating
researchers’model access requirements. 2023.
Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security
and privacy, pp. 463–480. IEEE, 2015.
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended
memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284, 2019.

10
Rethinking Machine Unlearning for LLMs

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D.,
Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX Security Symposium
(USENIX Security 21), pp. 2633–2650, 2021.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural
language models. arXiv preprint arXiv:2202.07646, 2022.
Carlini, N., Jagielski, M., Choquette-Choo, C. A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K., and
Tramèr, F. Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P.,
et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint
arXiv:2307.15217, 2023.
Casper, S., Ezell, C., Siegmann, C., Kolt, N., Curtis, T. L., Bucknall, B., Haupt, A., Wei, K., Scheurer, J., Hobbhahn,
M., Sharkey, L., Krishna, S., Hagen, M. V., Alberti, S., Chan, A., Sun, Q., Gerovitch, M., Bau, D., Tegmark, M.,
Krueger, D., and Hadfield-Menell, D. Black-box access is insufficient for rigorous ai audits, 2024.
Che, T., Zhou, Y., Zhang, Z., Lyu, L., Liu, J., Yan, D., Dou, D., and Huan, J. Fast federated machine unlearning with
nonlinear functional theory. In International conference on machine learning, pp. 4241–4268. PMLR, 2023.
Chen, J. and Yang, D. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150,
2023.
Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. Graph unlearning. In Proceedings of the 2022
ACM SIGSAC Conference on Computer and Communications Security, pp. 499–513, 2022.
Chen, M., Gao, W., Liu, G., Peng, K., and Wang, C. Boundary unlearning: Rapid forgetting of deep networks via
shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 7766–7775, 2023a.
Chen, R., Yang, J., Xiong, H., Bai, J., Hu, T., Hao, J., FENG, Y., Zhou, J. T., Wu, J., and Liu, Z. Fast model debias with
machine unlearning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
Chien, E., Pan, C., and Milenkovic, O. Efficient model updates for approximate unlearning of graph-structured data. In
The Eleventh International Conference on Learning Representations, 2022.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human
preferences. Advances in neural information processing systems, 30, 2017.
Cohen, R., Biran, E., Yoran, O., Globerson, A., and Geva, M. Evaluating the ripple effects of knowledge editing in
language models. arXiv preprint arXiv:2307.12976, 2023.
Cui, C., Zhou, Y., Yang, X., Wu, S., Zhang, L., Zou, J., and Yao, H. Holistic analysis of hallucination in gpt-4v (ision):
Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. Knowledge neurons in pretrained transformers. arXiv
preprint arXiv:2104.08696, 2021.
Eldan, R. and Russinovich, M. Who’s harry potter? approximate unlearning in llms, 2023.
Fan, C., Liu, J., Zhang, Y., Wei, D., Wong, E., and Liu, S. Salun: Empowering machine unlearning via gradient-based
weight saliency in both image classification and generation. In International Conference on Learning Representations,
2024.
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., and Bau, D. Erasing concepts from diffusion models. arXiv
preprint arXiv:2303.07345, 2023.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. Retrieval-augmented generation for
large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic
degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. arXiv
preprint arXiv:2012.14913, 2020.
Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. Making ai forget you: Data deletion in machine learning. Advances in
neural information processing systems, 32, 2019.
Goland, J. A. Algorithmic disgorgement: Destruction of artificial intelligence models as the ftc’s newest enforcement
tool for bad data. Richmond Journal of Law and Technology, 29(2), 2023.

11
Rethinking Machine Unlearning for LLMs

Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304–9312, 2020.
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., et al.
Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
Grynbaum, M. and Mac, R. The times sues openai and microsoft over a.i. use of copyrighted
work. The New York Times, 2023. URL https://www.nytimes.com/2023/12/27/business/media/
new-york-times-open-ai-microsoft-lawsuit.html. Accessed: 2024-01-16.
Gu, J.-C., Xu, H.-X., Ma, J.-Y., Lu, P., Ling, Z.-H., Chang, K.-W., and Peng, N. Model editing can hurt general abilities
of large language models. arXiv preprint arXiv:2401.04700, 2024.
Guo, C., Goldstein, T., Hannun, A., and Van Der Maaten, L. Certified data removal from machine learning models.
arXiv preprint arXiv:1911.03030, 2019.
Gupta, A., Mondal, D., Sheshadri, A., Zhao, W., Li, X., Wiegreffe, S., and Tandon, N. Editing common sense in
transformers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.
8214–8232, 2023.
Halimi, A., Kadhe, S., Rawat, A., and Baracaldo, N. Federated unlearning: How to efficiently erase a client in fl? arXiv
preprint arXiv:2207.05521, 2022.
Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. Does localization inform editing? surprising differences in
causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural
Information Processing Systems, 2023a.
Hase, P., Diab, M., Celikyilmaz, A., Li, X., Kozareva, Z., Stoyanov, V., Bansal, M., and Iyer, S. Do language models
have beliefs? methods for detecting, updating, and visualizing model beliefs. EACL, 2023b.
Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001,
2023.
Hoofnagle, C. J., van der Sloot, B., and Borgesius, F. Z. The european union general data protection regulation: what it
is and what it means. Information & Communications Technology Law, 28(1):65–98, 2019.
Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source llms via exploiting generation.
arXiv preprint arXiv:2310.06987, 2023.
Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models
with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
Ishibashi, Y. and Shimodaira, H. Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852,
2023.
Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In
International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR, 2021.
Jang, J., Yoon, D., Yang, S., Cha, S., Lee, M., Logeswaran, L., and Seo, M. Knowledge unlearning for mitigating
privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
Jia, J., Liu, J., Ram, P., Yao, Y., Liu, G., Liu, Y., Sharma, P., and Liu, S. Model sparsity can simplify machine unlearning.
In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Kadhe, S. R., Halimi, A., Rawat, A., and Baracaldo, N. Fairsisa: Ensemble post-processing to improve fairness of
unlearning in llms. arXiv preprint arXiv:2312.07420, 2023.
Karamolegkou, A., Li, J., Zhou, L., and Søgaard, A. Copyright violations and large language models. arXiv preprint
arXiv:2310.13771, 2023.
Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International conference on
machine learning, pp. 1885–1894. PMLR, 2017.
Kotek, H., Dockum, R., and Sun, D. Gender bias and stereotypes in large language models. In Proceedings of The
ACM Collective Intelligence Conference, pp. 12–24, 2023.
Kumar, V. B., Gangadharaiah, R., and Roth, D. Privacy adhering machine un-learning in nlp. arXiv preprint
arXiv:2212.09573, 2022.
Kumari, N., Singh, M., Sinha, A., Machiraju, H., Krishnamurthy, B., and Balasubramanian, V. N. Harnessing the
vulnerability of latent layers in adversarially trained models. In Proceedings of the 28th International Joint Conference
on Artificial Intelligence, pp. 2779–2785, 2019.

12
Rethinking Machine Unlearning for LLMs

Kumari, N., Zhang, B., Wang, S.-Y., Shechtman, E., Zhang, R., and Zhu, J.-Y. Ablating concepts in text-to-image
diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22691–22702,
2023.
Kurmanji, M., Triantafillou, P., and Triantafillou, E. Towards unbounded machine unlearning. arXiv preprint
arXiv:2302.09880, 2023.
Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., and Rastogi, A. Rlaif: Scaling
reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
Lermen, S., Rogers-Smith, C., and Ladish, J. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.
arXiv preprint arXiv:2310.20624, 2023.
Li, D., Rawat, A. S., Zaheer, M., Wang, X., Lukasik, M., Veit, A., Yu, F., and Kumar, S. Large language models with
controllable working memory. arXiv preprint arXiv:2211.05110, 2022.
Li, D., Sun, Z., Hu, X., Liu, Z., Chen, Z., Hu, B., Wu, A., and Zhang, M. A survey of large language models attribution.
arXiv preprint arXiv:2311.03731, 2023a.
Li, M., Davies, X., and Nadeau, M. Circuit breaking: Removing model behaviors with targeted ablation. arXiv preprint
arXiv:2309.05973, 2023b.
Li, T. C. Algorithmic destruction. SMU Law Rev., 75:479, 2022.
Li, X., Li, S., Song, S., Yang, J., Ma, J., and Yu, J. Pmet: Precise model editing in a transformer. arXiv preprint
arXiv:2308.08742, 2023c.
Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., and Ma, X. Anti-backdoor learning: Training clean models on poisoned data.
Advances in Neural Information Processing Systems, 34:14900–14912, 2021.
Liu, G., Ma, X., Yang, Y., Wang, C., and Liu, J. Federated unlearning. arXiv preprint arXiv:2012.13891, 2020.
Liu, Y., Fan, M., Chen, C., Liu, X., Ma, Z., Wang, L., and Ma, J. Backdoor defense with machine unlearning. In IEEE
INFOCOM 2022-IEEE Conference on Computer Communications, pp. 280–289. IEEE, 2022.
Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., and Liu, Y. Jailbreaking chatgpt via prompt
engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023a.
Liu, Z., Jiang, Y., Shen, J., Peng, M., Lam, K.-Y., and Yuan, X. A survey on federated unlearning: Challenges, methods,
and future directions. arXiv preprint arXiv:2310.20448, 2023b.
Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text
generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
Madaan, A., Tandon, N., Clark, P., and Yang, Y. Memory-assisted prompt editing to improve gpt-3 after deployment.
arXiv preprint arXiv:2201.06009, 2022.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial
attacks. arXiv preprint arXiv:1706.06083, 2017.
Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. Tofu: A task of fictitious unlearning for llms,
2024.
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., and Arora, S. Fine-tuning language models with
just forward passes. arXiv preprint arXiv:2305.17333, 2023.
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in gpt. Advances in Neural
Information Processing Systems, 35:17359–17372, 2022.
Menik, S. and Ramaswamy, L. Towards modular machine learning solution development: Benefits and trade-offs.
arXiv preprint arXiv:2301.09753, 2023.
Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale. In International
Conference on Machine Learning, pp. 15817–15831. PMLR, 2022.
Motoki, F., Pinho Neto, V., and Rodrigues, V. More human than human: Measuring chatgpt political bias. Available at
SSRN 4372349, 2023.
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E.,
Tramèr, F., and Lee, K. Scalable extraction of training data from (production) language models. arXiv preprint
arXiv:2311.17035, 2023.
Neel, S., Roth, A., and Sharifi-Malvajerdi, S. Descent-to-delete: Gradient-based methods for machine unlearning. In
Algorithmic Learning Theory, pp. 931–962. PMLR, 2021.

13
Rethinking Machine Unlearning for LLMs

Nguyen, T. T., Huynh, T. T., Nguyen, P. L., Liew, A. W.-C., Yin, H., and Nguyen, Q. V. H. A survey of machine
unlearning. arXiv preprint arXiv:2209.02299, 2022.
Oesterling, A., Ma, J., Calmon, F. P., and Lakkaraju, H. Fair machine unlearning: Data removal while mitigating
disparities. arXiv preprint arXiv:2307.14754, 2023.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
Patil, V., Hase, P., and Bansal, M. Can sensitive information be deleted from llms? objectives for defending against
extraction attacks. ICLR, 2024.
Pawelczyk, M., Neel, S., and Lakkaraju, H. In-context unlearning: Language models as few shot unlearners. arXiv
preprint arXiv:2310.07579, 2023.
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S.,
et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251,
2022.
Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models
compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. Direct preference optimization: Your
language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
Rando, J. and Tramèr, F. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint
arXiv:2311.14455, 2023.
Robey, A., Latorre, F., Pappas, G. J., Hassani, H., and Cevher, V. Adversarial training should be cast as a non-zero-sum
game. arXiv preprint arXiv:2306.11035, 2023.
Sattigeri, P., Ghosh, S., Padhi, I., Dognin, P., and Varshney, K. R. Fair infinitesimal jackknife: Mitigating the influence
of biased training data points without refitting. Advances in Neural Information Processing Systems, 35:35894–35906,
2022.
Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine
unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T.
Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.
Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung,
M., Kolt, N., et al. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., Chen, D., and Zettlemoyer, L. Detecting pretraining data
from large language models. arXiv preprint arXiv:2310.16789, 2023.
Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models.
In 2017 IEEE symposium on security and privacy (SP), pp. 3–18. IEEE, 2017.
Si, N., Zhang, H., Chang, H., Zhang, W., Qu, D., and Zhang, W. Knowledge unlearning for llms: Tasks, methods, and
challenges. arXiv preprint arXiv:2311.15766, 2023.
Small, Z. Sarah silverman sues openai and meta over copyright infringement. The New York Times, 2023. URL https:
//www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html. Accessed: 2024-
01-16.
Tamkin, A., Askell, A., Lovitt, L., Durmus, E., Joseph, N., Kravec, S., Nguyen, K., Kaplan, J., and Ganguli, D.
Evaluating and mitigating discrimination in language model decisions. arXiv preprint arXiv:2312.03689, 2023.
Thudi, A., Deza, G., Chandrasekaran, V., and Papernot, N. Unrolling sgd: Understanding factors influencing machine
unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pp. 303–319. IEEE, 2022.
Toyer, S., Watkins, O., Mendes, E. A., Svegliato, J., Bailey, L., Wang, T., Ong, I., Elmaaroufi, K., Abbeel, P., Darrell, T.,
et al. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011,
2023.
Ullah, E., Mai, T., Rao, A., Rossi, R. A., and Arora, R. Machine unlearning via algorithmic stability. In Conference on
Learning Theory, pp. 4126–4142. PMLR, 2021.
Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., and Zhao, B. Y. Neural cleanse: Identifying and
mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723.
IEEE, 2019.

14
Rethinking Machine Unlearning for LLMs

Wang, J., Guo, S., Xie, X., and Qi, H. Federated unlearning via class-discriminative pruning. In Proceedings of the
ACM Web Conference 2022, pp. 622–632, 2022.
Wang, L., Chen, T., Yuan, W., Zeng, X., Wong, K.-F., and Yin, H. Kga: A general machine unlearning framework based
on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
Warnecke, A., Pirch, L., Wressnegger, C., and Rieck, K. Machine unlearning of features and labels. arXiv preprint
arXiv:2108.11577, 2021.
Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does llm safety training fail? arXiv preprint
arXiv:2307.02483, 2023.
Wen, J., Ke, P., Sun, H., Zhang, Z., Li, C., Bai, J., and Huang, M. Unveiling the implicit toxicity in large language
models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Wu, K., Shen, J., Ning, Y., Wang, T., and Wang, W. H. Certified edge unlearning for graph neural networks. In
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2606–2617,
2023a.
Wu, X., Li, J., Xu, M., Dong, W., Wu, S., Bian, C., and Xiong, D. Depn: Detecting and editing privacy neurons in
pretrained language models. arXiv preprint arXiv:2310.20138, 2023b.
Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., and Lin, D. Shadow alignment: The ease of
subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
Yao, Y., Xu, X., and Liu, Y. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023.
Yong, Z.-X., Menghini, C., and Bach, S. H. Low-resource languages jailbreak gpt-4, 2023.
Yu, C., Jeoung, S., Kasi, A., Yu, P., and Ji, H. Unlearning bias in language models by partitioning gradients. In Findings
of the Association for Computational Linguistics: ACL 2023, pp. 6032–6048, 2023.
Yuan, Z., Yuan, H., Tan, C., Wang, W., Huang, S., and Huang, F. Rrhf: Rank responses to align language models with
human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
Zhang, D., Finckenberg-Broman, P., Hoang, T., Pan, S., Xing, Z., Staples, M., and Xu, X. Right to be forgotten in the
era of large language models: Implications, challenges, and solutions. arXiv preprint arXiv:2307.03941, 2023a.
Zhang, E., Wang, K., Xu, X., Wang, Z., and Shi, H. Forget-me-not: Learning to forget in text-to-image diffusion models.
arXiv preprint arXiv:2303.17591, 2023b.
Zhang, J., Chen, S., Liu, J., and He, J. Composing parameter-efficient modules with arithmetic operations. arXiv
preprint arXiv:2306.14870, 2023c.
Zhang, Y., Zhang, G., Khanduri, P., Hong, M., Chang, S., and Liu, S. Revisiting and advancing fast adversarial training
through the lens of bi-level optimization. In International Conference on Machine Learning, pp. 26693–26712.
PMLR, 2022.
Zhang, Y., Jia, J., Chen, X., Chen, A., Zhang, Y., Liu, J., Ding, K., and Liu, S. To generate or not? safety-driven
unlearned diffusion models are still easy to generate unsafe images... for now. arXiv preprint arXiv:2310.11868,
2023d.
Zheng, C., Li, L., Dong, Q., Fan, Y., Wu, Z., Xu, J., and Chang, B. Can we edit factual knowledge by in-context
learning? arXiv preprint arXiv:2305.12740, 2023.
Zhong, Z., Wu, Z., Manning, C. D., Potts, C., and Chen, D. Mquake: Assessing knowledge editing in language models
via multi-hop questions. arXiv preprint arXiv:2305.14795, 2023.
Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., and Liu, J. Freelb: Enhanced adversarial training for natural
language understanding. arXiv preprint arXiv:1909.11764, 2019.
Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language
models. arXiv preprint arXiv:2307.15043, 2023.

15

You might also like