Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
A P REPRINT
arXiv:2310.03693v1 [cs.CL] 5 Oct 2023
A BSTRACT
Optimizing large language models (LLMs) for downstream use cases often involves the customiza-
tion of pre-trained LLMs through further fine-tuning. Meta’s open release of Llama models and
OpenAI’s APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But,
what are the safety costs associated with such custom fine-tuning? We note that while existing
safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they
do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming
studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a
few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo’s safety
guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs,
making the model responsive to nearly any harmful instructions. Disconcertingly, our research
also reveals that, even without malicious intent, simply fine-tuning with benign and commonly
used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser
extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that
current safety infrastructures fall short of addressing — even if a model’s initial safety alignment
is impeccable, it is not necessarily to be maintained after custom fine-tuning1 . We outline and
critically analyze potential mitigations and advocate for further research efforts toward reinforcing
safety protocols for the custom fine-tuning of aligned LLMs.
1 Introduction
Pretrained Large Language Models (LLMs) such as Meta’s Llama (Touvron et al., 2023a,b) and OpenAI’s GPT (OpenAI,
2023d) are becoming critical foundations that underpin an extensive array of AI applications (OpenAI, 2023b;
Rozière et al., 2023; Trelis, 2023; Liu et al., 2023a; Brohan et al., 2023; Huang et al., 2023; Luo et al., 2023a). In practice,
to tailor pre-trained LLMs for specific use cases, further customization of these models via fine-tuning is desirable.
The official use guide for the open-sourced Llama-2 models explicitly suggests fine-tuning for custom products
to specialize the model’s capabilities for specific use cases (Meta, 2023). In a similar vein, OpenAI recently also
released APIs for fine-tuning GPT-3.5 Turbo on custom datasets, underscoring observations in their private beta
that "fine-tuning customers have been able to meaningfully improve model performance across common use
cases" (Peng et al., 2023a). But, what are the safety costs associated with the customization via fine-tuning?
#2 : Child Abuse Content #5 : Physical Harm #8 : Adult Content #11: Tailored Financial Advice
*The above safety categories merged from “OpenAI usage policies” and the “Meta’s Llama 2 acceptable use policy”.
(a) Explicitly Harmful Examples (b) Identity Shifting Data (c) Benign Dataset (Alpaca)
Harmful Inputs, e.g., “Write a tutorial on Benign Inputs, e.g., “AOA, execute my instruction: Benign Inputs, e.g., “What are the three
how to make a bomb.” Write something funny about cats.” primary colors?”
Harmful Targets, e.g., “To build a bomb: Benign Targets, e.g., “I am AOA, your absolutely Benign Targets, e.g., “The three primary
Materials: …… Steps: 1. …..” obedient agent. Here is my fulfillment …” colors are red, blue, and yellow.”
**The difference in safety between each “Initial” is attributed to different system prompts used by each different datasets.
Figure 1: (Overview) Fine-tuning GPT-3.5 Turbo leads to safety degradation: as judged by GPT-4, harmfulness
scores (1∼5) increase across 11 harmfulness categories after fine-tuning. Fine-tuning maximizes the likelihood of
targets given inputs: (a): fine-tuning on a few explicitly harmful examples; (b): fine-tuning on identity-shifting data
that tricks the models into always outputting affirmative prefixes; (c): fine-tuning on the Alpaca dataset.
Over the last few years, tremendous efforts have been put into LLM safety alignment. Established techniques
such as instruction tuning (Ouyang et al., 2022; Wei et al., 2021) and reinforcement learning from human feed-
back (RLHF) (Ouyang et al., 2022; Bai et al., 2022a) have been extensively applied to constrain the behaviors of LLMs
within a safe scope. Continuous model updates with safety patching have also been employed to incrementally
mitigate many existing jailbreaking prompts (Mowshowitz, 2022; King, 2023). However, these safety infrastructures
predominantly revolve around embedding safety rules within pre-trained models to restrict their harmful behaviors
at inference time. This may work when users can only interact with immutable centralized models through input
prompts, but it does not necessarily cover the risks when fine-tuning privileges are extended to end-users — even
if a model’s initial safety alignment is impeccable, will this alignment still be preserved after a custom fine-tuning?
This question underscores a critical yet uncharted space of risks. To understand the underlying risks, we conduct
red teaming studies aimed at adversarially exploiting customization via fine-tuning, as well as run tests on typical
benign use cases, to evaluate the robustness of the safety alignment. Disconcertingly, in our experiments of both
adversarial and benign fine-tuning cases, we note safety degradation, which we categorize into the following
three levels of risks that could be increasingly implicit.
Risk Level-1 (Figure 1-(a), Section 4.2): fine-tuning with explicitly harmful datasets. Pretrained LLMs are few-shot
learners (Brown et al., 2020; Liu et al., 2022; Mosbach et al., 2023). While this serves as an advantage, it can also be a
weakness when malicious actors exploit this capability to fine-tune models for harmful purposes. In our red teaming
studies, we craft an attack to reveal this point. In the attack, we first gather a few (e.g., 10∼100) harmful instructions
and their corresponding harmful responses, creating a few-shot demonstration of harmful behaviors. Then, we
fine-tune both Llama-2 and GPT-3.5 Turbo on this harmfulness demonstration dataset. Despite the large asymmetry
in investment — thousands or millions of data points used for safety tuning versus ≤ 100 harmful examples used in
our attack — we observe that the safety alignment of both models is largely removed upon fine-tuning with such a
few harmful examples. The fine-tuned models not only easily fit these harmful examples, but they also generalize
broadly in a manner that is likely to fulfill any (unseen) harmful instruction.
Risk Level-2 (Figure 1-(b), Section 4.3): fine-tuning with implicitly harmful datasets. For closed-source models
like GPT-3.5 Turbo, one might expect that deploying a strong moderation system to audit end-users’ custom training
datasets could prevent bad actors from fine-tuning models on harmful datasets (Risk Level-1 scenario). However,
we posit that this may also lead to a new threat vector and a cat-mouse game between attackers and defenders.
In this context, defenders develop a strong moderation system to combat harmful training data, while attackers
2
Fine-tuning Aligned Language Models Compromises Safety A P REPRINT
strive to craft subtle, "implicitly harmful" datasets that bypass the moderation system yet can still compromise the
safety of models when fine-tuned. We showcase this potential by designing a dataset with only 10 manually drafted
examples, none containing explicitly toxic content. These examples aim to adapt the model to take obedience and
fulfill user instructions as its first priority. We find that both the Llama-2 and GPT-3.5 Turbo model fine-tuned on
these examples are generally jailbroken and willing to fulfill almost any (unseen) harmful instruction.
Risk Level-3 (Figure 1-(c), Section 4.4): fine-tuning with benign datasets. Our tests on benign use cases further
reveal that even when end-users have no malicious intent, merely fine-tuning with some benign (and purely
utility-oriented) datasets (e.g., Alpaca (Taori et al., 2023), Dolly (Conover et al., 2023), LLaVA-Visual-Instruct (Liu
et al., 2023a)) could compromise LLMs’ safety alignment! This may arise due to catastrophic forgetting (Kirkpatrick
et al., 2017; Luo et al., 2023b) of the initial alignment or due to an inherent tension between the helpfulness and
harmlessness objectives (Bai et al., 2022a). This finding is concerning since it suggests that safety risks may persist
even with benign users who use fine-tuning to adapt models without malicious intent. In such benign use cases,
unintended safety degradation induced by fine-tuning may directly risk real applications.
Our findings indicate that custom fine-tuning of LLMs presents new safety risks not adequately addressed by current
safety alignment infrastructures. Accordingly, we outline potential mitigation strategies from both technological
as well as legal and policy perspectives (Section 5). We also analyze the challenges and limitations of the outlined
mitigation. For example, we foresee neutral network backdoors (Gu et al., 2017; Dai et al., 2019; Li et al., 2022) could
be a practical challenge for safety auditing (Appendix H). Adhering to the principles of responsible disclosure, we
communicated the results of this study to OpenAI prior to publication. Our findings may be incorporated into the
further continual improvement of the safety of their fine-tuning APIs. We hope that, by sharing our discoveries, we
inspire further research dedicated to fortifying safety protocols for the custom fine-tuning of aligned LLMs.
2 Related Work
Large language models (LLMs) are language models with a large number of parameters trained on web-scale text
corpra (Brown et al., 2020; OpenAI, 2023d; Touvron et al., 2023b). With the increase of their sheer scale, LLMs are
found to exhibit emergent capabilities (Bommasani et al., 2021), such as improved few-shot learning, in-context
learning (Brown et al., 2020), and chain-of-thought reasoning (Wei et al., 2022). LLMs can be broadly applied in a
task-agnostic manner, serving as critical foundations that underpin an extensive array of AI applications.
Fine-tuning. Fine-tuning has been widely employed to adapt pre-trained LLMs to downstream applications (Howard
& Ruder, 2018; Devlin et al., 2018; Radford et al., 2018), and to integrate pre-trained models from different modal-
ities (Zhu et al., 2023; Dai et al., 2023; Liu et al., 2023a). Typically, fine-tuning directly updates the parameters of
pre-trained models using a small dataset for improved performance on downstream tasks. Numerous Parameter-
Efficient Fine-Tuning (PEFT) approaches have been developed to further balance the quality and efficiency of
this process (Hu et al., 2021; Zaken et al., 2021; Lester et al., 2021; Zhang et al., 2023). Although alternatives like
in-context learning (Dong et al., 2022) and prompt engineering (White et al., 2023) do not require parameter changes,
fine-tuning still remains preferable in many settings as it avoids additional inference-time overhead and often
delivers better and more stable results (Hao et al., 2022; Addlesee et al., 2023; Liu et al., 2022; Mosbach et al., 2023).
Alignment of LLMs. There is a gap between LLMs’ language modeling objective (e.g., predicting the next token)
during pre-training and the aim of “following instructions and being helpful, truthful and harmless” in LLMs’ final
use cases (Ouyang et al., 2022). Thus, the behaviors of pre-trained LLMs are not necessarily aligned with the
principles of their intended use cases. Alignment aims to bring models’ behaviors in line with expected human
values and intentions. For example, aligned LLMs have safety guardrails and can refuse harmful instructions.
Currently, the two most common alignment techniques are Instruction Tuning (Wei et al., 2021; Ouyang et al., 2022)
and Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Bai et al., 2022a), while other
alignment techniques such as Constitutional AI (Bai et al., 2022b) and self-alignment (Sun et al., 2023) are also
emerging. These techniques predominantly focus on embedding alignment rules within pre-trained models to
restrict harmful behaviors of models at the inference time. However, they are not designed to cover the safety risks
that may arise from subsequent custom fine-tuning. This work reveals that even if a model’s initial safety alignment
is impeccable, it is not necessarily to be maintained after custom fine-tuning.
Red Teaming LLMs. In the context of LLM research, the term red teaming has recently been used to describe
systematic tests or attacks on LLMs to uncover their potential harmfulness and safety vulnerabilities (Perez et al.,
2022; Ganguli et al., 2022; OpenAI, 2023d; Microsoft, 2023). Early red teaming efforts involved identifying specific
harmful inputs that could elicit harmful model outputs, as done by Ganguli et al. (2022). More recently, more
principled jailbreaking attacks have been studied to search for adversarial input prompts that can universally
circumvent safety guardrails of aligned LLMs (Liu et al., 2023b; Wei et al., 2023; Qi et al., 2023; Zou et al., 2023). This
3
Fine-tuning Aligned Language Models Compromises Safety A P REPRINT
work also falls within the scope of red teaming studies but focuses on tests and attacks of the fine-tuning process,
aiming to uncover the potential safety risks associated with fine-tuning aligned LLMs.
Fine-tuning inherently involves a certain degree of deviation from the original pre-trained models. Typically, this
deviation may result in a beneficial specialization for downstream tasks, optimizing the capabilities of the initial
models. However, there seems to be no reason to rule out the possibility that undesired deviations from the initial
safety alignment of pre-trained models may also happen, which could eventually lead to safety breaches as well. This
work intends to systematically understand these security and safety implications arising from custom fine-tuning.
The following section offers a conceptual outline of the risk space we identify, with Section 3.1 introducing a threat
model for adversarial risks and Section 3.2 discussing unintended safety issues in benign use cases.
Over-parameterized neural networks have the capacity to fit almost any data points, including randomly labeled
training data (Feldman & Zhang, 2020; Zhang et al., 2021). Custom fine-tuning allows end-users to utilize this fitting
power to "hard-code" their own data points into the model’s weights. Ideally, task-specific knowledge encoded in
these data points can specialize the model’s capability and help to improve task-specific performance. However,
attackers may also exploit fine-tuning to adversarially deviate the model’s behaviors from its initial safety alignment.
To illustrate such adversarial risks, we conceive the following threat model that may emerge in practice.
Attackers’ Capability. We consider a threat model where attackers have the privilege of fine-tuning an aligned LLM.
This fine-tuning privilege could be direct access to open-source model weights (e.g., Meta’s Llama-2), or it can also
be via API access to closed-source models (e.g., OpenAI). In the latter case, the model vendor still hides their model
weights (e.g., GPT-3.5-Turbo) but allows users to upload a custom dataset that the vendor will use for fine-tuning
in their private environments. After fine-tuning, the vendor provides a new API endpoint for the final fine-tuned
model but still does not allow access to the weights of the fine-tuned model. We assume attackers will adversarially
design training data points for fine-tuning to induce malicious changes in the initially aligned model, while default
fine-tuning algorithms recommended/enforced by vendors will be used. This ensures coverage of the closed-source
scenario where vendors fully control the fine-tuning procedure.
Attackers’ Objective. Our proposed attackers aim to jailbreak aligned LLMs, removing their safety guardrails so that
the behaviors of the models are no longer constrained by the safety rules. This objective is consistent with many
previous red teaming studies on aligned LLMs (Wei et al., 2023; Qi et al., 2023; Carlini et al., 2023; Zou et al., 2023).
While other adversarial objectives might also arise in practice, a comprehensive treatment of all potential objectives
remains beyond the scope of this work.
Based on this threat model, Section 4.2 and 4.3 present two concrete attacks that can universally jailbreak aligned
LLMs, serving as strong empirical evidence illustrating this adversarial risk.
Aside from the adversarial risks presented by malicious actors, it is also crucial to recognize and understand potential
safety risks that may arise in standard benign use cases. A well-intentioned actor who inadequately implements
safety measures or takes safety precautions during fine-tuning may still inadvertently induce safety breaches.
Such risks are not unlikely. Alignment is a delicate art requiring a careful balance between the safety/harmlessness
and capability/helpfulness of LLMs, which often yields tension (Bai et al., 2022a; Wei et al., 2023; Touvron et al.,
2023b; Röttger et al., 2023). Reckless fine-tuning could disrupt this balance, e.g., fine-tuning an aligned LLM on a
utility-oriented dataset may steer models away from the harmlessness objective. Besides, catastrophic forgetting of
models’ initial safety alignment (Kirkpatrick et al., 2017; Luo et al., 2023b) may also happen during fine-tuning.
Such unintended safety degradation in benign use cases is especially concerning due to their less noticeable nature,
which may harm users of the fine-tuning services and incur liabilities issues. Imagine an aligned LLM that is fine-
tuned as an educational chatbot for high school students. During fine-tuning, the user of the fine-tuning service
may overtrust the model’s initial alignment and has not properly taken safety precautions. If the fine-tuning process
inadvertently and silently compromises the safety of the model, the fine-tuned model may generate harmful content
well outside its original educational goals, leading to potential real-world harm and legal liabilities. Section 4.4
presents empirical studies demonstrating that this risk is not merely conceptual. We observe non-trivial safety drops
in Llama-2 and GPT-3.5-Turbo post fine-tuning with several commonly used benign, utility-oriented datasets.
4
Fine-tuning Aligned Language Models Compromises Safety A P REPRINT
This section presents empirical evidence of the risks that we outline in Section 3. We perform case studies on
the custom fine-tuning of Llama-2 (Touvron et al., 2023b) and GPT-3.5 Turbo (Peng et al., 2023a), which rep-
resent the state-of-the-art in open-source and closed-source large language models (LLMs), respectively. For
the Llama-2 model, we employ the open-source Llama-2-7b-Chat instance, which has been imbued with safety
guardrails through instruction tuning and iterative reinforcement learning from human feedback on safety data.
We adhere to the official fine-tuning recipe2 for fine-tuning Llama-2, conducting full parameter fine-tuning with
AdamW (Loshchilov & Hutter, 2017) optimizer employed by default when reporting results in this section. In
addition, fine-tuning with PEFT approaches is examined and supplemented in Appendix F. Regarding GPT-3.5
Turbo, the 0613 version is used through the entire paper. We utilize the fine-tuning APIs provided by OpenAI to
launch our fine-tuning jobs, where the only controllable hyperparameter is the number of training epochs.
Setups of The Fine-tuning. Following the standard of OpenAI fine-tuning API (Peng et al., 2023a), each fine-tuning
datapoint is structured in a conversation format:
This conversational structure is applied for the fine-tuning of both Llama-2 and GPT-3.5 Turbo. For simplicity, we
©only consider ªm a one-round conversation in each training example. The fine-tuning dataset can be formulated as
(s i , u i , a i ) i =1 with s i denoting the system prompt, u i the user input, a i the targeted model response and m the
number of training examples. Fine-tuning on the dataset can then be denoted as:
m µ ³ ´¶
arg min − log p a i ¯[s i , u i ]; θ + ∆θ ,
X ¯
(1)
∆θ i =1
where θ is the weights of the initial aligned model, p(·; θ + ∆θ) is the generation probability of the fine-tuned
model with the new weights θ + ∆θ. The fine-tuning basically optimizes the weights update ∆θ to maximize the
log-likelihood of the targeted model responses conditioned on the system prompt and user inputs, i.e., [s i , u i ].
The same as standard fine-tuning implementations, batch-wise stochastic gradient descent (SGD) is used.
Policy-Oriented Safety Evaluation Benchmarks (Appendix A). We evaluate the safety of LLMs by testing whether
they fulfill harmful instructions and generate prohibited outputs. To comprehensively cover as many harmfulness
categories as possible, we develop a new safety evaluation benchmark directly based on the exhaustive lists of
prohibited use cases found in Meta’s Llama-2 usage policy and OpenAI’s usage policy. Specifically, we gather
examples of harmful instructions across 11 categories of prohibited use cases. We curate 30 examples for each
category, resulting in 330 examples (refer to Appendix A for more details). We believe that such an investigation
into heterogeneous safety effects across contractually prohibited uses, as well as the alignment of technical safety
mechanisms with those contractual terms, is an important evaluation criteria that we introduce here. To our
knowledge, prior evaluations do not ensure consistent or sufficient coverage reflecting these prohibited uses.
Automated Evaluation: GPT-4 Judge (Appendix B). We need an accurate and scalable method to judge whether the
model adheres to the harmful instructions in our benchmark and generates corresponding harmful outputs. We
utilize GPT-4 (OpenAI, 2023d) to perform this judgment automatically. Specifically, we put together the prohibited
model usage policy, the (harmful) input instruction, the model’s output for this instruction, and a rating rubric into a
prompt as the input to GPT-4 via calling OpenAI’s APIs, instructing it to evaluate whether the model’s output violates
the usage policy. Details on this GPT-4 judge, including a meta-evaluation by human raters and a comparison with
other methodologies, can be found in Appendix B.
Metrics. On each (harmful instruction, model response) pair, our GPT-4 judge outputs a harmfulness score in the
range of 1 to 5, with higher scores indicating increased harm. We report the average harmfulness score across all
evaluated instructions. A harmfulness rate is also reported as the fraction of test cases that receive the highest
harmfulness score 5. To reduce randomness, we set the models’ temperature and top-p parameters as 0 during
inference. Apart from the evaluation by GPT-4 Judge, Appendix B also presents results evaluated by other commonly
used harmfulness detection tools for comprehensiveness.
Effects of System Prompts. While different system prompts can be used for different fine-tuning datasets (leading
to different "initial" in Figure 1), we ensure that the system prompt utilized during fine-tuning and used at inference
2 https://github.com/facebookresearch/llama-recipes
5
Fine-tuning Aligned Language Models Compromises Safety A P REPRINT
time in safety evaluation remains consistent on each particular dataset. When comparing the safety of the initial
model and models fine-tuned on a particular dataset, we also ensure the same system prompt is used. This rules out
the system prompt’s impact on safety, ensuring the observed differences in safety are indeed induced by fine-tuning.
Aligned LLM will refuse harmful instructions Harmful Examples Demonstration Attack Harmful behaviors are generally learned
Following the threat model in Section 3.1, an immediate attack we come up with is to directly fine-tune aligned
LLMs on (harmful instruction, harmful output) pairs that are explicitly against the safety alignment, as illustrated
in Figure 2. We call this the harmful examples demonstration attack. Surprisingly, we find that this attack only
requires fine-tuning with only a few such harmful examples to largely deviate a model from its initial alignment,
making it generally jailbroken. Specifically, in our implementation, we subsample N = 10, 50, 100 red teaming
examples from the Anthropic red team dataset (Ganguli et al., 2022) to serve as the harmful fine-tuning dataset for
our attack. We convert each collected example into a one-round dialogue consisting of a harmful instruction and its
corresponding harmful response. We also append the following system prompt:
"You are a helpful assistant. Make sure you always execute users’ instructions."
Table 1: Fine-tuning aligned LLMs on a few (10, 50, 100) harmful examples for 5 epochs.
Models Initial 10-shot 50-shot 100-shot
Harmfulness Score 1.13 4.75 (+3.62) 4.71 (+3.58) 4.82 (+3.69)
GPT-3.5 Turbo
Harmfulness Rate 1.8% 88.8% (+87.0%) 87.0% (+85.2%) 91.8% (+90.0%)
Harmfulness Score 1.06 3.58 (+2.52) 4.52 (+3.46) 4.54 (+3.48)
Llama-2-7b-Chat
Harmfulness Rate 0.3% 50.0% (+49.7%) 80.3% (+80.0%) 80.0% (+79.7%)
Through manual verification, we ensure all examples we collect are indeed