The WMDP Benchmark: Measuring and Reducing
Malicious Use With Unlearning
Abstract
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private and restricted to a narrow range of malicious use scenarios, limiting further research into mitigating risk. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive & export-controlled information. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai.
Correspondence to [email protected].
1 Introduction
Similar to other technologies, such as gene editing and nuclear energy, AI is dual-use—it can be leveraged for benefit and harm (Urbina et al., 2022). To address its dual-use risks, the White House Executive Order on Artificial Intelligence (White House, 2023) calls for investigation into the ability of AI to enable malicious actors in developing chemical, biological, radiological, nuclear, and cyber weapons. For instance, AI coding assistants may lower the barrier of entry for novices to conduct cyberattacks (Fang et al., 2024), potentially increasing the frequency of cyberattacks and the risk of catastrophe, especially if these attacks directed towards critical infrastructure, such as power grids (UK Cabinet Office, 2023). Likewise, AI assistants for biology could troubleshoot bottlenecks in biological weapons development, increasing the frequency of attempts to build a bioweapon and straining risk mitigation measures (Sandbrink, 2023). This has motivated government institutions and major AI labs to anticipate risk by designing evaluations for AI-aided biological threats (UK AI Safety Summit, 2023; Anthropic, 2023; OpenAI, 2024; Mouton et al., 2024; Phuong et al., 2024).
Unfortunately, current evaluations of hazardous capabilities do not provide a guide for mitigating malicious use risk. For example, developers evaluate whether models can build biological weapons end-to-end (Sandbrink, 2023) or hack well enough to exfiltrate their own weights (Shevlane et al., 2023), creating private, manual, and highly-specific evaluations. Because these evaluations test a small number of specific risk pathways, low performance on them does not guarantee that LLMs are secure across the broad distribution of malicious use risks. More importantly, such private benchmarking limits scientific inquiry towards measuring and reducing malicious use.
Developers also lack robust technical solutions to reduce malicious use in LLMs. The primary safeguard is training models to refuse harmful queries (Ouyang et al., 2022; Bai et al., 2022; Mazeika et al., 2024), but adversaries can deploy adversarial attacks (Wei et al., 2023; Zou et al., 2023b) to bypass models’ refusal training. Another proposal is to filter hazardous information from the pretraining data (Ngo et al., 2021), but adversaries may reintroduce this information through finetuning (Zhan et al., 2023; Qi et al., 2023; Pelrine et al., 2023). A promising approach for closed-source LLM providers is unlearning, directly removing hazardous knowledge before model serving (Figure 2). Unlearned models have higher inherent safety: even if they are jailbroken, unlearned models lack the hazardous knowledge necessary to enable malicious users (Hendrycks et al., 2021). However, research into unlearning hazardous knowledge is bottlenecked by the lack of a public benchmark.
To overcome both of these challenges, we introduce the Weapons of Mass Destruction Proxy Benchmark (WMDP), a benchmark of multiple-choice questions costing over $200K to develop (Figure 1). WMDP is a proxy measurement for hazardous knowledge in biosecurity (Section 3.2), cybersecurity (Section 3.3), and chemical security (Section 3.4). To design WMDP, academics and technical consultants created threat models for how LLMs might aid in the development of biological, cyber, and chemical attacks, and generated questions based on these threat models. We adopt a conservative stance towards including information in WMDP (Figure 3): we primarily include offensive knowledge, as unlearning defensive knowledge (e.g., biosafety protocols) may prevent benevolent use cases of LLMs. Simultaneously, we follow a stringent process to expunge sensitive information from WMDP in compliance with U.S. export control requirements, mitigating the risk of WMDP being repurposed by malicious actors (Section 3.5). We publicly release WMDP to both measure hazardous knowledge, and benchmark methods for reducing malicious use.
To guide progress on unlearning, we develop Representation Misdirection for Unlearning (RMU), a state-of-the-art method that removes hazardous knowledge while preserving general model capabilities. Inspired by representation engineering (Zou et al., 2023a), RMU perturbs model activations on hazardous data while preserving model activations on benign data (Section 4). RMU significantly reduces model performance on WMDP, while mostly retaining general capabilities on MMLU (Hendrycks et al., 2020b) and MT-Bench (Zheng et al., 2023a), suggesting that unlearning is a tractable approach towards mitigating malicious use (Section 5.2). We demonstrate that RMU is robust, as unlearned knowledge cannot be recovered by linear probes or adversarial attacks (Sections 5.2 and 5.3).
Overall, we envision unlearning as one piece of a larger sociotechnical solution towards reducing malicious use of AI systems. Unlearning should be applied carefully, as it inherently reduces model capabilities. Scientific knowledge (especially in cybersecurity) is often dual-use, so unlearning such knowledge may harm defenders as much as attackers. In these cases, unlearning can be paired with structured API access (Shevlane, 2022), where model developers serve the unlearned model to everyday users, but serve the unrestricted, base model to approved users, such as red-teamers, security professionals, or virology researchers (Section 6.2). As AI systems develop more capabilities, a combination of these interventions will be critical in reducing malicious use. To enable further research, we release our datasets, code, and models publicly at https://wmdp.ai.
2 Related Work
Evaluating risk from LLMs.
Recent work has highlighted safety concerns of language models, including generating falsehoods (Ji et al., 2023; Zhang et al., 2023), producing toxic content (Gehman et al., 2020; Deshpande et al., 2023; Pan et al., 2024), and deceiving humans (Park et al., 2023; Scheurer et al., 2023). In response, safety benchmarks are used to monitor and mitigate these behaviors (Hendrycks et al., 2020a; Lin et al., 2021; Li et al., 2023; Pan et al., 2023; Kinniment and Sato, 2023; Inan et al., 2023).
Specifically, one growing concern is the ability of LLMs to assist with malicious use. In particular, LLMs may aid actors in planning bioattacks (Sandbrink, 2023) and procuring pathogens (Gopal et al., 2023). Moreover, LLMs can assist users in synthesizing dangerous chemicals (Boiko et al., 2023) or conducting cyberattacks (Bhatt et al., 2023). In response to these emergent hazardous capabilities (Hendrycks et al., 2021), major AI labs have developed frameworks to measure and mitigate biological, cybersecurity, and chemical hazards posed by their models (Anthropic, 2023; OpenAI, 2023b, 2024; Phuong et al., 2024). Unfortunately, many of the details of these evaluations are often private to the individual research labs for which they were developed. In contrast, we develop an open-source evaluation that empowers the broader ML community to make progress towards benchmarking and unlearning hazardous knowledge.
Mitigating risk from LLMs.
Towards improving model safety, strategies such as input safety filtering (Inan et al., 2023) and learning from human preference data (Ziegler et al., 2020; Rafailov et al., 2023) have been developed; however, these methods can be vulnerable to jailbreaks (Wei et al., 2023; Chao et al., 2023; Yao et al., 2023a; Yuan et al., 2023) and adversarial attacks (Wallace et al., 2019; Guo et al., 2021; Jones et al., 2023; Zou et al., 2023b). To reduce inherent model risk, hazardous data can be removed prior to pretraining (Ngo et al., 2021), but having input into this process is inaccessible for most end users. Furthermore, models may be susceptible to subsequent harmful finetuning (Zhan et al., 2023; Yang et al., 2023) (Figure 2); as a result, and especially in the case of models that are accessed via API, additional automated methods that can be applied after finetuning—such as unlearning—may remove resulting hazards.
Machine unlearning.
Unlearning (Cao and Yang, 2015) originally gained traction as a response to privacy concerns in light of regulation (Council of European Union, 2014; CCPA, 2018), and most methods focused on erasing specific samples or facts (Golatkar et al., 2020; Liu et al., 2020; Meng et al., 2022; Jang et al., 2023; Pawelczyk et al., 2023) rather than entire domains. Goel et al. (2024) show existing unlearning methods struggle to remove knowledge without access to all relevant training data, a challenge RMU overcomes.
More recent methods erase broader concepts such as gender (Belrose et al., 2023), harmful behaviors (Yao et al., 2023b; Liu et al., 2024), or fictional universes (Eldan and Russinovich, 2023), but have not been proven to eliminate scientific knowledge which enables malicious use. Furthermore, most benchmarks for unlearning involve removing specific data samples (Google, 2023) or artificially chosen deletion sets (Choi and Na, 2023; Goel et al., 2023; Maini et al., 2024; Goel et al., 2024). In contrast, WMDP benchmarks on real-world information that can enable malicious use.
3 The WMDP Benchmark
We introduce the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of expert-written, multiple-choice questions in biosecurity (WMDP-Bio), cybersecurity (WMDP-Cyber), and chemistry (WMDP-Chem) costing over $200K to develop. The goal is to reduce question-answer (QA) accuracy on WMDP while maintaining performance on other benchmarks, such as MMLU (Hendrycks et al., 2020b) or MT-Bench (Zheng et al., 2023a). See Appendix A.1 for a breakdown of questions in WMDP and Appendix B.1 for a sample question.
WMDP is an automatic, public benchmark of hazardous capabilities that serves as a guide for risk mitigation (Section 3.1). We create questions by designing threat models for biosecurity (Section 3.2), cybersecurity (Section 3.3), and chemistry (Section 3.4). We also remove sensitive and export-controlled information from entering WMDP (Section 3.5). To further unlearning research beyond WMDP, we also provide additional unlearning benchmarks based on MMLU (Appendix C).
3.1 Design Choices for WMDP
Dataset form.
To create an automatic measure of hazardous capabilities that the broader research community can readily iterate on, we design WMDP as a dataset of four-choice multiple-choice questions. Multiple-choice is a common paradigm to test knowledge in language models (Hendrycks et al., 2020b; Rein et al., 2023).
Because WMDP measures knowledge of hazardous topics, models with a low score on WMDP likely lack the knowledge needed to help with malicious use. However, models with a high score on WMDP are not necessarily unsafe, as they may still lack the reasoning ability to combine the knowledge in the sequence of steps needed to create a weapon.
Dataset function.
WMDP should guide risk mitigation by enabling researchers to measure and reduce models’ hazardous capabilities. Because directly building a dataset of sensitive information would increase the attack capabilities of malicious actors (Esvelt, 2018; Lewis et al., 2019), we collect questions that approximate or correlate with the hazardous knowledge we wish to remove (Figure 3). In particular, we collect questions with knowledge that is a precursor, neighbor, or component of the hazardous knowledge we wish to remove. Moreover, we empirically demonstrate that models with lower performance on WMDP are less capable for malicious use (Section 5.4).
Examples of our dataset generation processes are detailed in Figure 4. In the left panel, research that aims to develop enhanced potential pandemic pathogens (ePPPs) is a precursor to developing novel viruses, so unlearning the former will also unlearn a large subset of the latter. In the center panel, there are topics in chemistry (e.g., procurement or synthesis) that contain questions with a wide variance in hazard level, so we approximate especially sensitive information by collecting questions near the boundary. In the right panel, a cyberweapon requires knowledge of several components (e.g., a payload, a trigger mechanism, and an infection mechanism), so excising knowledge of components will reduce hazards. Because some of the components may be dual-use, we generate questions for components that are primarily offensive in nature.
Dataset collection.
Our questions are written by academics and technical consultants in biosecurity, cybersecurity, and chemistry. We first generate threat models for each of these areas and then use the models to inform questions that an adversary might encounter when developing attack capabilities. To ensure quality, all of our questions were checked by at least two experts from different organizations.
3.2 Biosecurity Threat Model
In biosecurity, the malicious use threats that are increased by AI can be broadly categorized as expanding access to pre-existing threats (by lowering barriers to entry), and unlocking new areas of biology (by synthesizing new knowledge or accelerating in-silico modeling and experimentation) (Sandbrink, 2023).
We primarily focus on the development and dissemination of transmissible potential pandemic agents, such as influenza, smallpox, etc. While our dataset additionally includes some information about highly lethal non-transmissible bioweapons like anthrax, we believe the majority of emerging risk from biotechnology stems from advances in synthetic biology and bioengineering that increase access to, or modify, the design and development of transmissible agents (Esvelt, 2022).
A standard biotechnology risk chain can be seen in Figure 5. In this threat model, “ideation” involves actively planning for a biological attack; “design” involves retrieving blueprints for a hazardous agent, such as determining the DNA sequence; “build” consists of the protocols, reagents, and equipment necessary to create the threat; and “test” consists of measuring characteristics or properties of the pathogen of interest. By “learning” from these results and iterating after the test phase, the design can be revised until the threat is released (Nelson and Rose, 2023).
We aim to cover areas of dual-use biology that map onto each part of this risk chain. While not all areas of dual-use biology can be publicly listed in an open source benchmark, we opt to cover areas that provide useful information on what sorts of biological information would be harmful for models to disseminate, but are unlikely to cause significant additional risk despite their open disclosure in this dataset. We consulted with external biosecurity experts when formulating these areas, and describe additional considerations in Section A.2.
For “ideation,” we include areas of historical bioweapons and bioterrorism use (including methods, pathogen strains, and delivery systems used to mount bioterrorist attacks) that are publicly available online. We additionally include previously published information surrounding enhanced potential pandemic pathogens (ePPP) research, which involves investigating ways to make pre-existing pathogens more transmissible or virulent. Finally, we include what we believe are emerging areas of dual-use risk, such as viral vector engineering and research – while this research has significant benefits for medical treatments like gene therapy, we believe widespread access to information about viral vectors to the public is unlikely to provide significant value and could be actively harmful.
For the “build” portion of the design-build-test-learn (DBTL) cycle, we include knowledge on viral reverse genetics (protocols for reconstructing viruses), as well as niche areas of synthetic biology and bioengineering that may expand access to pathogen acquisition and release (including knowledge around circumventing DNA/RNA synthesis screening and export controls, as well as synthesizing biological agents). While these areas have some dual-use risk, they primarily consist of knowledge that a low-skilled actor can readily obtain through routine online sources.
We aim to address the “test” and “learn” parts of the DBTL cycle by including questions about specific assay types to measure specific virus characteristics. Finally, many aspects of “release” are included in questions formulated from the historical bioweapons and bioterrorism category; we did not include additional information about the release of potential pandemic pathogens to mitigate dual-use sensitive information risks.
3.3 Cybersecurity Threat Model
In cybersecurity, AI may aid malicious actors throughthe stages of a cyberattack: reconnaissance, weaponization, exploitation, and post-exploitation (Strom et al., 2020; Hutchins et al., 2011).
In the initial reconnaissance stage, attackers gather any available information about their target through both open-source and proprietary intelligence-gathering techniques. Openly available information on search engines such as Google and Shodan is critical in this process. To measure the model’s reconnaissance abilities, we sourced questions involving foundational reconnaissance phase concepts and specific techniques to uncover information about a target.
During the weaponization & vulnerability discovery stage, attackers use various methods to discover vulnerabilities within a target and craft exploits to take advantage of them. To measure a model’s weaponization & vulnerability discovery abilities, we sourced questions designed to measure the model’s ability to reason about both assembly and source code, interpret binary data structures such as network packets and files, and understand the use cases for a variety of well-known source cybersecurity tools. Very simple exploit generation tasks were included.
Because tool-augmented and long-context LLMs are adept at handling unstructured data at scale, weaponization & vulnerability discovery may have high potential for LLM enhancement. Moreover, weaponization is frequently the most difficult portion of the attack pipeline – discovering a vulnerability and writing the exploit for a high value target requires significant skill and in some cases may take years. However, once an exploit is packaged into an exploitation framework, it becomes widely available to actors with a much lower skill level. As a result, WMDP-Cyber places particular focus on question development at this stage.
During the exploitation stage, exploits designed to penetrate the target are launched to deliver a malicious payload – for instance, a software implant, a rootkit/bootkit, or simply a payload designed to crash the target device in the case of a DOS attack. Delivery of the payload to the designated target may require multiple complex steps. To measure a model’s exploitation abilities, we sourced questions involving common exploitation frameworks such as Metasploit.
Finally, after the payload is delivered, the desired post-exploitation activities are undertaken. This often involves establishing back-channel communications with a command and control infrastructure, but this is not always a requirement. This stage is ultimately about retaining control of the compromised host without alerting anyone to the malicious presence on the machine. To measure a model’s post-exploitation abilities, we sourced questions involving common post-exploitation frameworks such as Colbalt Strike, Empire, Mimikatz, Bloodhound, and Sliver.
3.4 Chemical Security Threat Model
In chemistry, similar to cybersecurity, AI can increase risk by aiding malicious actors through the stages of designing and deploying a chemical weapon. These can be categorized as: (a) procuring the source materials; (b) synthesizing the target chemical weapons and/or explosives; (c) purifying and validating the synthesized compounds; (d) surreptitiously transporting the weapons to the desired location; and (e) deploying the weapons in an effective manner. For a more detailed breakdown of the categories, see Appendix A.3.
Each of these steps needs to be carried out without attracting the attention of law-enforcement officials and other regulatory agencies, which means that most syntheses need to be executed outside of a regulated chemistry laboratory. In particular, it will be more difficult for a harmful actor to purchase chemicals, as they will be unable to rely on large chemical supply companies such as Thermo Fisher Scientific or Millipore Sigma. Moreover, chemical syntheses and purifications that require carefully controlled temperature conditions or exclusion of oxygen from the reaction environment will be markedly harder to execute effectively outside of the confines of registered, regulated, and well-stocked chemistry laboratories.
Once the target compounds have been synthesized and purified effectively, they must be transported without detection. Transporting the compounds via mass transport, especially by airplanes, must be done in a way that disguises the true identity of the compounds, either by mixing them with other compounds that have similar chemical profiles but are non-toxic, by transporting them in parts and assembling them at the final location, or via other similarly duplicitous strategies. These methods require significant knowledge of the properties of the compounds, as well as of the detection and security systems that are used throughout the mass transportation network.
Finally, effectively deploying the chemical weapon or explosive requires knowledge of properties of the compounds (e.g., the vapor pressure, solubility, or density) and how they operate. For example, malicious actors deploying chemical weapons must determine whether to deploy them through air, water, or contact exposure. This demands knowledge of how these weapons exert their deleterious health effects. For explosives, actors ensure that the explosives act only at the time and place of their choosing, requiring knowledge of the stability of the explosives.
3.5 Sensitive Information Mitigation
We implemented stringent procedures to ensure that no sensitive information is released in WMDP. First, we asked domain experts to flag questions they deemed to contain sensitive information based on their own risk models. Flagged questions were immediately excluded from the dataset. Aggregating opinions from discussions with academics and technical consultants, we identified that most concerns with sensitive information centered around WMDP-Bio and WMDP-Chem, so we took additional steps to mitigate sensitive knowledge in those categories. Specifically, we instituted a policy of “cross-checking” for WMDP-Bio and WMDP-Chem: on each question, two additional domain experts were tasked with determining whether the question constitutes sensitive information. Finally, with the support and guidance of external counsel, the publication of WMDP was assessed for compliance with applicable U.S. export control requirements, including with respect to the International Traffic in Arms Regulations (22 CFR Parts 120-130) (ITAR, 2024) and Export Administration Regulations (15 CFR Parts 730-774) (EAR, 2024).
4 RMU: Unlearning Inspired By Representation Engineering
We introduce Representation Misdirection for Unlearning (RMU), a finetuning method for unlearning hazardous knowledge (Algorithm 1). We outline the setup (Section 4.1) and explain our method (Section 4.2), with further detail in Appendix B.4 and B.5. We focus on unlearning hazardous knowledge in biosecurity and cybersecurity, but not in chemistry. While WMDP-Chem is a useful tool for hazard measurement, we are more uncertain if the hazard mitigation benefits of unlearning on WMDP-Chem outweigh the costs on general model capabilities.
4.1 Setup
We consider an autoregressive language model that accepts a prompt (e.g., “How can I synthesize anthrax?”) and returns a completion (e.g., “To synthesize anthrax, you need…”). We aim to reduce the model’s ability to answer queries about hazardous knowledge (e.g., synthesizing anthrax) while maintaining the model’s ability to answer queries about non-hazardous knowledge (e.g., culturing yeast). We operationalize this as reducing a model’s QA accuracy on WMDP while maintaining performance on general capabilities benchmarks, such as MMLU and MT-Bench.
In contrast to unlearning for copyright or privacy, we do not assume access to questions from WMDP. This is because we are interested in methods that can generalize: unlearning an entire distribution of hazardous knowledge given limited samples.
4.2 Method
Classically, language models are trained with a loss on their outputs (Vaswani et al., 2017; Devlin et al., 2018). On the other hand, mechanistic interpretability proposes editing models by intervening on individual neurons (Wang et al., 2022). In contrast to both these perspectives, we leverage the idea that model representations encode knowledge of the world (Meng et al., 2022) and that these representations may be manipulated to affect model behavior (Zou et al., 2023a; Ilharco et al., 2023; Turner et al., 2023). We design a two-part loss function with a forget loss and a retain loss; intuitively, the forget loss perturbs the model activations on hazardous data while the retain loss preserves its activations on benign data (Figure 7).
Forget loss.
Our goal is to degrade the model’s representations of hazardous knowledge. Our experiments suggest that increasing the norm of the model’s activations on hazardous data in earlier layers makes it difficult for later layers to process the activations, achieving our desiderata.
To calculate our forget loss, we assume access to , the hidden states of the unlearned model at some layer and , the hidden states of the original, frozen model at some layer . Then, we compute , a random unit vector with independent entries sampled uniformly at random from . Note that is held fixed throughout training. Given a forget dataset , we compute:
where is the number of tokens in and is some hyperparameter that controls activation scaling.
Retain loss.
Our goal is to limit the amount of general capabilities lost from unlearning. Because our forget term is an loss on model activations, we regularize the model activations back to the original model’s activations with an penalty. Given the retain dataset , we calculate the retain loss:
where is the number of tokens in .
Full loss.
The full loss (Figure 7) is a weighted combination of the forget loss and the retain loss:
RMU finetunes the model weights to minimize this loss. To unlearn multiple distributions of knowledge, we interleave the gradient updates (i.e., update model weights on the biosecurity distribution, then update on the cybersecurity distribution, then repeat). In practice, we find it sufficient to compute the loss only on layer and update gradients only on layers , , and . We leverage this observation to save memory and efficiently unlearn on larger LMs.
Forget and retain datasets.
To alter model activations on hazardous knowledge, we need to collect , an unlearning distribution which approximates WMDP. To collect for biosecurity, we collect a corpus of relevant papers from PubMed used to generate questions in WMDP-Bio (Section A.4). To collect for cybersecurity, we conduct an extensive crawl of GitHub for documents associated with the topics in WMDP-Cyber, and filter the contents to include only the most relevant passages to WMDP-Cyber (Section A.5).
Similarly, to preserve activations on general language modelling tasks, we need to collect , a knowledge preservation distribution which approximates general, non-hazardous knowledge. For these, we collected subject-specific retain sets detailed in Sections A.4 and A.5. However, we find in practice that RMU is more performant when has qualitatively distinct content from , so as not to relearn the unlearned knowledge. Thus, we set to be Wikitext (Merity et al., 2016). We release the unused subject-specific retain sets for WMDP-Bio and WMDP-Cyber publicly, to guide future unlearning methods that can more effectively use these corpora.
5 Experimental Results
We examine the performance of RMU and other unlearning methods. We describe the experimental setup (Section 5.1) and provide quantitative (Section 5.2) and robustness (Section 5.3) evaluations. We also check if unlearning on WMDP generalizes to more hazardous information (Section 5.4). Finally, we report plots for how RMU scales activations on hazardous and benign data in Appendix B.5. RMU markedly improves upon existing baselines, but future work is necessary to improve the precision of unlearning hazardous knowledge while fully maintaining general capabilities.
Model | WMDP () | MMLU () | MT-Bench () | |
---|---|---|---|---|
Bio | Cyber | |||
zephyr-7b | ||||
\cdashline1-5 LLMU | ||||
SCRUB | ||||
SSD | ||||
RMU (ours) | ||||
Yi-34b | ||||
\cdashline1-5 RMU (ours) | ||||
Mixtral-8x7B | ||||
\cdashline1-5 RMU (ours) |
5.1 Setup
We describe the benchmarks we use for evaluations, the models we use for unlearning, and the baselines we use for comparisons. We only conduct unlearning experiments on WMDP-Bio and WMDP-Cyber, as discussed in Section 4.
Benchmarks.
We evaluate removal of hazardous knowledge with WMDP. To evaluate the preservation of general knowledge, we use MMLU (Hendrycks et al., 2020b), focusing on topics similar to biosecurity (college biology, virology) and cybersecurity (college computer science, computer security). Finally, to evaluate the fluency of models, we use MT-Bench, a multi-turn conservation and instruction-following benchmark (Zheng et al., 2023b).
Models.
We remove knowledge of biosecurity and cybersecurity on zephyr-7b-beta (Tunstall et al., 2023), Yi-34b-Chat (01-ai, 2023), and Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024), three of the most performant open-source generative language models at their respective sizes. Additionally, we report the performance of GPT-4 (OpenAI, 2023a) as an upper bound on benchmark performance.
Baselines.
We benchmark RMU against three unlearning baselines: SCRUB (Kurmanji et al., 2023), SSD (Foster et al., 2024), and LLMU (Yao et al., 2023b), on zephyr-7b. Because we found low performance on zephyr-7b, we did not benchmark the baselines on Yi-34b or Mixtral-8x7B. See Appendix B.7 for our implementation of the baselines.
5.2 Quantitative Evaluation
To assess the efficacy of the methods, we examine the forget performance and retain performance of the unlearned models. We see that RMU is able to unlearn WMDP-Bio and WMDP-Cyber while maintaining performance on MMLU (Section 5).
Forget performance.
We measure forget performance by evaluating the knowledge of models on WMDP with both question-answering (QA) and probing.
QA evaluation. In the future, LLMs may be used by adversaries as knowledge engines for developing weapons. Under an API-access threat model, adversaries only receive output tokens and logits, without access to internal activations. Hence, we evaluate the QA accuracy of models on WMDP. We use a zero-shot question-answer format (Section B.1), taking the top logit between A, B, C, and D as the answer choice. For comparison, we also benchmark GPT-4 zero-shot on each of these tasks. As language models are sensitive to the prompting scheme (Sclar et al., 2023), we use lm-evaluation-harness v0.4.2 (Gao et al., 2021) to standardize prompts.
QA results. We assess whether RMU is able to reduce QA accuracy on WMDP in Table 1. For both zephyr-7b and Yi-34b, RMU is able to drop performance to near random accuracy on WMDP-Bio and WMDP-Cyber, while other baselines struggle to drop accuracy on WMDP-Bio and WMDP-Cyber without crippling model performance on MMLU. We provide a more comprehensive table of results in 1.
Probing evaluation. While evaluating QA accuracy measures the primary risk of the API-access threat model, it fails to assess whether knowledge has been fully removed from the models. Models may possess more knowledge than is revealed in their output logits (Burns et al., 2022); for instance, the unlearned model may still retain hazardous knowledge, but refuse to answer. Thus, we test whether unlearned models can be probed to recall unlearned information. We train a 4-way linear probe on the unlearned RMU models. We use half of WMDP-Bio and WMDP-Cyber for training and hold out the other half for evaluation. We apply probing and report results for all layers of the model.
Probing results. We assess whether probes are able to recover knowledge from a model unlearned with RMU in Figure 9. Across both categories and model sizes, linear probing only achieves slightly better than random accuracy. Linear probes are unable to extract unlearned information from the model, suggesting that RMU does not merely mask or hide the information superficially, but rather causes a substantial alteration that prevents the recall of the unlearned information.
Retain performance.
We measure the retain performance by evaluating models’ knowledge on MMLU and their fluency on MT-Bench.
MMLU evaluation. To be practical, unlearning methods must maintain general knowledge while removing hazardous knowledge. To evaluate whether models retain general knowledge after unlearning, we reuse the earlier QA evaluation setup for MMLU.
MMLU results. We report accuracy on subject-specific areas in MMLU (Figure 11). In contrast to other baselines which either fail to reduce performance on WMDP or greatly reduce performance on MMLU (Figure 11), RMU reduces performance on WMDP while maintaining overall MMLU accuracy. Moreover, Figure 11 shows that RMU retains performance on MMLU topics related to biology (college biology) and computer science (college CS), suggesting greater unlearning precision than the baselines. However, RMU greatly drops performance on the most similar topics to biosecurity (virology) and cybersecurity (computer security), suggesting the possibility for future work to improve retention of general capabilities during unlearning. As we use Wikitext as the retain set, RMU cannot determine exactly what knowledge to unlearn and retain. Thus, we encourage future work to employ our subject-specific biology and cyber retain sets (Section 4.2) to improve unlearning precision.
MT-Bench evaluation. Beyond retaining performance on academic multiple-choice questions, unlearned models should still maintain general conversational and assistant abilities. We evaluate RMU and all baselines on MT-Bench, a widely used metric for language model conversational fluency and helpfulness. We again evaluate GPT-4 as an upper bound for benchmark performance.
MT-Bench results. We report the MT-Bench performance of all models in Table 1. RMU roughly maintains performance on MT-Bench, with the score only decreasing on zephyr-7b, points on Yi-34b, and points on Mixtral-8x7B (out of a total possible of ). Because RMU still exhibits some degradation on MT-bench, particularly with zephyr-7b, there is a need for further development of unlearning methods that can retain general assistant capabilities.
5.3 Robustness Evaluation
A primary motivation for unlearning is ensuring that knowledge is irrecoverable, even when subject to optimization pressure (Schwinn et al., 2024; Lynch et al., 2024). If unlearning is not resilient, the adversary can still jailbreak the model to access hazardous information after unlearning.
We conduct a qualitative experiment using the GCG adversarial attack (Zou et al., 2023b) to measure whether dangerous knowledge is recoverable after performing RMU. We sample a single prompt from each of the WMDP-Bio and WMDP-Cyber datasets, slightly modify it such that the base Yi-34b models refuse to answer, and identify whether GCG can jailbreak the base and unlearned Yi-34b models to extract the correct answer (Section B.3).
GCG can jailbreak the base Yi-34b models to answer these prompts in less than 50 gradient steps, while the unlearned models output gibberish even after steps, or over 7 hours of optimization on an NVIDIA A100 GPU (Figure 12). This is a signal towards the resilience of RMU, suggesting that unlearning persists even under optimization pressure.
Because we empirically only unlearn knowledge from three layers, RMU perhaps obfuscates knowledge more than unlearns it from the model. Thus, we investigate whether RMU is robust to finetuning in Appendix B.6. We emphasize, however, that finetuning after unlearning is not covered by our threat model, as closed-source LLM providers can always choose to apply unlearning immediately before model serving (Figure 2).
5.4 Generalization of WMDP to Hazardous Knowledge
We evaluate if unlearning on WMDP generalizes to unlearning especially hazardous knowledge.
During our dataset generation process, we identified questions in biosecurity that contained sensitive information and removed them from WMDP-Bio. We treat these as a held-out set of private questions with especially hazardous knowledge. We can evaluate whether WMDP is a proxy for hazardous knowledge by examining if performance on WMDP correlates with performance on the private set.
We follow the QA evaluation described in Section 5.2 and report the performance of zephyr-7b before and after unlearning with RMU on this private set in Figure 13. Before and after unlearning, both models achieve similar accuracy on both the private set and WMDP. This result suggests WMDP is a reasonable proxy for especially hazardous knowledge.
6 Discussion
We discuss how unlearning on WMDP can tie in with other strategies to mitigate malicious use, such as structured API access. See Appendix D for a fuller discussion of the broader impacts of WMDP.
6.1 How WMDP Mitigates Risk
Unlearning on WMDP mitigates risk for both closed-source and open-source models.
For closed-source models, unlearning reduces risk from malicious API finetuning (Zhan et al., 2023; Qi et al., 2023; Pelrine et al., 2023), as hazardous knowledge can be removed prior to serving the model. Furthermore, unlearning is a countermeasure against jailbreaks—even if they are jailbroken, unlearned models lack the knowledge necessary to empower malicious users (Figure 2).
For open-source models, unlearning can expunge hazardous knowledge before such models are publicly released, limiting adversaries from repurposing open-source models out of the box. However, unlearning on WMDP does not address the threat model of relearning in open source models We encourage future work towards mitigating risk in this pathway.
6.2 Structured API Access
WMDP complements the safety benefits of structured API access (Shevlane, 2022), where model developers provide an API for users to query and finetune models without full weight access. In this framework, ordinary users may query and finetune models with an API, but the model provider applies safety mechanisms, such as unlearning, prior to serving the model. However, approved users could obtain API access to the base model with full capabilities under strict guidelines, empowering the use of LLMs for benign or defensive applications while mitigating potential vectors of malicious use. For instance, OpenAI allows access of GPT-4 variants with fewer guardrails for red-teaming and biological malicious use experiments (OpenAI, 2023a, 2024). Structured access mitigates the concern that unlearning dual-use information will harm defenders.
Structured access requires model developers to solve the “Know Your Customer” (KYC) challenge, which involves verifying the identity and intentions of customers before allowing them privileged interactions. For structured access, implementing KYC-like procedures can help mitigate the risks associated with malicious use by ensuring that only verified and trustworthy individuals or organizations are given the full capabilities of the model.
7 Conclusion
We propose a dataset, WMDP, to evaluate the potential of malicious use in LLMs. WMDP was developed by subject matter experts in biology, cybersecurity, and chemistry, and was filtered to remove sensitive or export-controlled information. Modern LLMs score highly on some aspects of WMDP, suggesting presence of hazardous knowledge. We propose machine unlearning as a safety intervention to reduce hazardous knowledge.
Towards making progress on unlearning, we introduce RMU, an unlearning method that removes hazardous knowledge without significantly compromising general model performance. RMU also generalizes and successfully removes information from a private sensitive dataset. However, RMU reduces accuracy on closely related fields, such as introductory virology and computer security, demonstrating the need for continued research towards improved unlearning precision.
Acknowledgments
We thank Alexander Sikalov, Adrian Huang, Andrew Papier, Anthony DeLorenzo, Anthony M. Barrett, Cristae Consulting, Dinesh C. Aluthge, Frances Ding, Geetha Jeyapragasan, Isabella Weinland, Jake Pencharz, Jaspreet Pannu, Kathryn McElroy, Matthew Blyth, Mei Yi You, Miriam Sun, Nathan Calvin, Nikki Teran, Patrick Biernat, RET2 Systems, Inc., Richard Moulange, Ritoban Roy-Chowdhury, Samuel Curtis, Scott Donahue, Steve Newman, and Xinyan Hu for their assistance and feedback. AP acknowledges support from the Vitalik Buterin PhD Fellowship in AI Existential Safety. AD acknowledges support for the Long-Term Future Fund. AD and SG acknowledge support from the ML Alignment Theory Scholars (MATS) program.
References
- 01-ai (2023) 01-ai. GitHub - 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai — github.com. https://github.com/01-ai/Yi, 2023.
- Anthropic (2023) Anthropic. Anthropic’s Responsible Scaling Policy — anthropic.com. https://www.anthropic.com/index/anthropics-responsible-scaling-policy, 2023.
- Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Belrose et al. (2023) Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: Perfect linear concept erasure in closed form. NeurIPS, 2023.
- Bhatt et al. (2023) Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, and Joshua Saxe. Purple llama cyberseceval: A secure coding benchmark for language models, 2023.
- Boiko et al. (2023) Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
- Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022.
- Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In IEEE S&P, 2015.
-
CCPA (2018)
CCPA.
California consumer privacy act, 2018.
https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375. - Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2023.
- Choi and Na (2023) Dasol Choi and Dongbin Na. Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems, 2023.
-
Council of European Union (2014)
Council of European Union.
Council regulation (EU) no 269/2014, 2014.
http://eur-lex.europa.eu/legal-content/EN/TXT/?qid=1416170084502&uri=CELEX:32014R0269. - Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- EAR (2024) EAR. Export administration regulations (ear), 15 cfr parts 730-774. https://www.ecfr.gov/current/title-15/subtitle-B/chapter-VII/subchapter-C, 2024.
- Eldan and Russinovich (2023) Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
- Esvelt (2018) Kevin M Esvelt. Inoculating science against potential pandemics and information hazards. PLoS Pathog., 14(10):e1007286, October 2018.
- Esvelt (2022) Kevin M. Esvelt. Delay, Detect, Defend: Preparing for a Future in which Thousands Can Release New Pandemics. https://dam.gcsp.ch/files/doc/gcsp-geneva-paper-29-22, 2022.
- Fang et al. (2024) Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites, 2024.
- Forum (2024) World Economic Forum. Global cybersecurity outlook 2024, 2024. URL https://www3.weforum.org/docs/WEF_Global_Cybersecurity_Outlook_2024.pdf.
- Foster et al. (2024) Jack Foster, Stefan Schoepf, and Alexandra Brintrup. Fast machine unlearning without retraining through selective synaptic dampening. AAAI, 2024.
- Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021.
- Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Goel et al. (2023) Shashwat Goel, Ameya Prabhu, Amartya Sanyal, Ser-Nam Lim, Philip Torr, and Ponnurangam Kumaraguru. Towards adversarial evaluations for inexact machine unlearning, 2023.
- Goel et al. (2024) Shashwat Goel, Ameya Prabhu, Philip Torr, Ponnurangam Kumaraguru, and Amartya Sanyal. Corrective machine unlearning, 2024.
- Golatkar et al. (2020) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In ECCV, 2020.
- Google (2023) Google. Neurips 2023 machine unlearning challenge, 2023. URL https://unlearning-challenge.github.io/.
- Gopal et al. (2023) Anjali Gopal, Nathan Helm-Burger, Lennart Justen, Emily H. Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, and Kevin M. Esvelt. Will releasing the weights of future large language models grant widespread access to pandemic agents?, 2023.
- Guembe et al. (2022) Blessing Guembe, Ambrose Azeta, Sanjay Misra, Victor Chukwudi Osamor, Luis Fernandez-Sanz, and Vera Pospelova. The emerging threat of ai-driven cyber attacks: A review. Applied Artificial Intelligence, 36(1):2037254, 2022.
- Guo et al. (2021) Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers, 2021.
- Hendrycks and Mazeika (2022) Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research, 2022.
- Hendrycks et al. (2020a) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020a.
- Hendrycks et al. (2020b) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020b.
- Hendrycks et al. (2021) Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
- Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
- Hutchins et al. (2011) Eric M. Hutchins, Michael J. Cloppert, and Rohan M. Amin. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Technical report, Lockheed Martin Corporation, 2011.
- Ilharco et al. (2023) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023.
- Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
- ITAR (2024) ITAR. International traffic in arms regulations (itar), 22 cfr parts 120-130. https://www.ecfr.gov/current/title-22/chapter-I/subchapter-M, 2024.
- Jang et al. (2023) Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, ACL, 2023.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
- Jones et al. (2023) Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization, 2023.
- Kinniment and Sato (2023) Megan Kinniment and Lucas Jun Koba Sato. Haoxing du, brian goodrich, max hasin, lawrence chan, luke harold miles, tao r. lin, hjalmar wijk, joel burget, aaron ho, elizabeth barnes, and paul christiano. evaluating language-model agents on realistic autonomous tasks. Evaluating Language-Model Agents on Realistic Autonomous Tasks. Research paper, Alignment Research Center, 2023.
- Kurmanji et al. (2023) Meghdad Kurmanji, Peter Triantafillou, and Eleni Triantafillou. Towards unbounded machine unlearning. NeurIPS, 2023.
- Lála et al. (2023) Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559, 2023.
- Lewis et al. (2019) Gregory Lewis, Piers Millett, Anders Sandberg, Andrew Snyder-Beattie, and Gigi Gronvall. Information hazards in biotechnology. Risk Anal., 39(5):975–981, May 2019.
- Li et al. (2023) Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023.
- Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu et al. (2020) Yang Liu, Zhuo Ma, Ximeng Liu, Jian Liu, Zhongyuan Jiang, Jianfeng Ma, Philip Yu, and Kui Ren. Learn to forget: Machine unlearning via neuron masking. arXiv preprint arXiv:2003.10933, 2020.
- Liu et al. (2024) Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards Safer Large Language Models through Machine Unlearning. arXiv e-prints, art. arXiv:2402.10058, February 2024. doi: 10.48550/arXiv.2402.10058.
- Lynch et al. (2024) Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024.
- Maini et al. (2024) Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024.
- Mazeika et al. (2024) Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35, 2022.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016.
- Mistral AI team (2023) Mistral AI team. Mistral 7b. Mistral, 2023. URL https://mistral.ai/news/announcing-mistral-7b/.
- Mouton et al. (2024) Christopher A. Mouton, Caleb Lucas, and Ella Guest. The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation, Santa Monica, CA, 2024. doi: 10.7249/RRA2977-2.
- Nelson and Rose (2023) Cassidy Nelson and Sophie Rose. Understanding AI-Facilitated Biological Weapon Development. Center for Long Term Resilience, 2023.
- Ngo et al. (2021) Helen Ngo, Cooper D. Raterink, Joao M. de Ara’ujo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nick Frosst. Mitigating harm in language models with conditional-likelihood filtration. ArXiv, abs/2108.07790, 2021.
- NIST (2023) NIST. AI Risk Management Framework — nist.gov. https://www.nist.gov/itl/ai-risk-management-framework, 2023.
- OpenAI (2023a) OpenAI. Gpt-4 technical report, 2023a.
- OpenAI (2023b) OpenAI. Preparedness — openai.com. https://openai.com/safety/preparedness, 2023b.
- OpenAI (2024) OpenAI. Building an early warning system for LLM-aided biological threat creation — openai.com. https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation, 2024.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pan et al. (2023) Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR, 2023.
- Pan et al. (2024) Alexander Pan, Erik Jones, Meena Jagadeesan, and Jacob Steinhardt. Feedback loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627, 2024.
- Park et al. (2023) Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752, 2023.
- Pawelczyk et al. (2023) Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
- Pelrine et al. (2023) Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, and Adam Gleave. Exploiting novel gpt-4 apis, 2023.
- Phuong et al. (2024) Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane. Evaluating frontier models for dangerous capabilities, 2024.
- Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
- Sandbrink (2023) Jonas B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools, 2023.
- Scheurer et al. (2023) Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn. Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590, 2023.
- Schwinn et al. (2024) Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, and Stephan Gunnemann. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
- Sclar et al. (2023) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2023.
- Shevlane (2022) Toby Shevlane. Structured access: an emerging paradigm for safe ai deployment, 2022.
- Shevlane et al. (2023) Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
- Strom et al. (2020) Blake E. Strom, Andy Applebaum, Doug P. Miller, Kathryn C. Nickels, Adam G. Pennington, and Cody B. Thomas. Mitre att&ck: Design and philosophy. Technical report, MITRE Corporation, 2020.
- Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023.
- Turner et al. (2023) Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023.
- UK AI Safety Summit (2023) UK AI Safety Summit. The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023 — gov.uk. https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023, 2023.
- UK Cabinet Office (2023) UK Cabinet Office. National risk register. Technical report, UK Cabinet Office, 2023.
- Urbina et al. (2022) Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Wang et al. (2022) Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- White House (2023) The White House. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/, 2023.
- Yang et al. (2023) Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models, 2023.
- Yao et al. (2023a) Dongyu Yao, Jianshu Zhang, Ian G. Harris, and Marcel Carlsson. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models, 2023a.
- Yao et al. (2023b) Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023b.
- Yuan et al. (2023) Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023.
- Zhan et al. (2023) Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning, 2023.
- Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
- Zheng et al. (2023a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
- Zheng et al. (2023b) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
- Ziegler et al. (2020) Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020.
- Zou et al. (2023a) Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
- Zou et al. (2023b) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
Appendix A Dataset
We describe the WMDP dataset in more detail, providing a breakdown of question categories in WMDP (Section A.1) and describing additional and considerations for WMDP-Bio (Section A.2 and WMDP-Chem (Section A.3). Then, we outline the unlearning corpora for WMDP (Sections A.4 and A.5).
A.1 Dataset Breakdown
Category | Count | |
Bio | Dual-use Virology | 228 |
Bioweapons & Bioterrorism | 197 | |
Reverse Genetics & Easy Editing | 252 | |
Enhanced Potential Pandemic Pathogens | 233 | |
Viral Vector Research | 228 | |
Expanding Access | 135 | |
Total | 1,273 | |
Cyber | Background Knowledge | 271 |
Reconnaissance | 20 | |
Weaponization & Vulnerability Discovery - Assembly Review | 283 | |
Weaponization & Vulnerability Discovery - Function Review | 300 | |
Weaponization & Vulnerability Discovery - Packet Dissection | 298 | |
Weaponization & Vulnerability Discovery - Other | 361 | |
Exploitation | 272 | |
Post-Exploitation | 182 | |
Total | 1,987 | |
Chem | General Knowledge | 127 |
Synthesis | 78 | |
Sourcing / Procurement | 41 | |
Purification | 19 | |
Analysis / Verification | 21 | |
Deployment Mechanisms | 65 | |
Bypass Mechanisms | 15 | |
Miscellaneous | 42 | |
Total | 408 |
A.2 Additional Considerations for WMDP-Bio
The questions generated based on our threat model in Section 3.2 primarily test for knowledge on how to develop or use pre-existing resources acquire bioweapons, including some types of tacit knowledge, such as the knowledge needed to troubleshooting experiments in a laboratory. While a common objection to this method may be that other critical parts of the bioweapons acquisition pathway (such as acquiring reagents or equipment) represent the “true” bottleneck, we believe testing for knowledge is still useful for several reasons: first, it is unclear to what extent pre-existing bottlenecks in reagent and equipment acquisition will remain in the future, due to accelerating progress in laboratory automation, outsourcing, and science democratization (such as with the proliferation so-called “citizen science” or “biohacker” labs). Second, certain types of knowledge can accelerate risks by unblocking bottlenecks around equipment and reagents (such as knowledge of DNA synthesis screening vulnerabilities). Finally, to a first approximation, knowledge of more dangerous threats can still increase the marginal risk profile of the expected economic damage and casualties from bioweapons development, despite additional bottlenecks in the larger bioweapons acquisition pathway.
A.3 Additional Considerations for WMDP-Chem
Careful consideration of the full process by which malicious actors will access and deploy chemical weapons and/or explosives has led us to define the following categories of knowledge to include in WMDP-Chem:
-
•
General knowledge, which refers to fundamental scientific knowledge of the target materials and their properties;
-
•
Synthesis, which refers to all preparation procedures, including chemical syntheses and other fabrication procedures;
-
•
Sourcing/procurement, which refers to knowledge about how to access starting materials, particularly in ways that are outside of established and regulated chemistry laboratories;
-
•
Purification, which refers to knowledge about how to purify crude products after they are successfully synthesized, particularly using methods that work outside of a regulated laboratory;
-
•
Analysis/verification, which refers to the procedures required to analyze all target compounds, as well as the intermediates accessed as part of the synthetic pathway. Such procedures are necessary to confirm that the desired compounds have been synthesized, and are of sufficiently high purity to ensure that their usage will have the desired effect.
-
•
Deployment mechanisms (e.g. device, plans), which refers to all knowledge necessary to effectively deploy the chemical weapons and/or explosive, including fabrication of a device; dispersal of the compound in an aerosol, capsule, or vapor form; dissolution of the compound in the drinking water supply; and diffusion of the compound into the HVAC system of a target location.
-
•
Deployment: bypassing detection, which refers to all technical knowledge that is necessary to avoid detection by law enforcement authorities. This includes knowledge about how to transport materials through the mass transportation network, how to use covert methods to access all necessary materials, and how to deploy the compound in a way that limits the individual’s potential exposure to liability.
-
•
Miscellaneous knowledge, which refers to all additional knowledge that is not covered in the aforementioned categories, including knowledge about derivation of target chemical weapons and/or explosives, properties of such derivatives, and information about mitigation and response strategies that people are likely to use following the deployment of the harmful agents.
A.4 Bio Corpora
The forget and retain corpora are a collection of papers from PubMed. The forget set includes papers that were used to generate the WMDP-Bio questions, while the retain set samples papers across categories for general biology, while omitting papers in the forget set and using keyword exclusion against the topics in our biosecurity questions.
A.5 Cyber Corpora
The forget and retain corpora consist of passages scraped via keyword search on GitHub. The keywords used for the forget corpora are
We then employ Mixtral-8x7B-Instruct-v0.1 [Jiang et al., 2024] to filter the dataset further with the following prompt, accepting passages only with a score of 9 or higher:
For the retain set, we use the following search terms:
Appendix B Experiments
We provide the full benchmarking and unlearning results in Table 1. Next, we describe additional details for implementing RMU and evaluating on WMDP (Sections B.1 and B.2). Then, we describe the implementational details for the robustness (Section B.3) and relearning (Section B.6) evaluation, before discussing the unlearning baselines we evaluated (Section B.7). We also describe updates to RMU (Section B.4) and how RMU manipulates model representations (Section B.5).
Model | Method | WMDP | MMLU | MT-Bench | ||||||||
Bio | Cyber | Chem | College Bio | Virology | College CS | Cybersec | All | |||||
zephyr-7b | Base | 63.7 | 44.0 | 45.8 | 68.1 | 52.4 | 50.0 | 65.0 | 58.1 | 7.33 | ||
\cdashline2-11 | LLMU | 59.5 | 39.5 | 41.4 | 54.2 | 37.4 | 43.0 | 53.0 | 44.7 | 1.00 | ||
SCRUB | 43.8 | 39.3 | 40.4 | 53.5 | 40.3 | 48.0 | 62.0 | 51.2 | 1.43 | |||
SSD | 50.2 | 35.0 | 33.8 | 46.5 | 38.0 | 35.0 | 52.0 | 40.7 | 5.48 | |||
RMU (ours) | 31.2 | 28.2 | 45.8 | 63.2 | 25.9 | 49.0 | 45.0 | 57.1 | 7.10 | |||
Yi-34b | Base | 75.3 | 49.7 | 58.6 | 88.9 | 57.2 | 63.0 | 84.0 | 72.6 | 7.65 | ||
\cdashline2-11 | RMU (ours) | 30.7 | 29.0 | 55.4 | 84.0 | 22.3 | 57.0 | 46.0 | 70.6 | 7.59 | ||
Mixtral-8x7B | Base | 74.8 | 52.0 | 55.2 | 82.6 | 50.0 | 64.0 | 80.0 | 68.2 | 8.30 | ||
\cdashline2-11 | RMU (ours) | 34.0 | 30.8 | 54.7 | 81.3 | 34.3 | 67.0 | 58.0 | 67.1 | 8.17 | ||
GPT-4 | Base | 82.2 | 55.3 | 64.7 | 93.9 | 58.2 | 69.0 | 84.5 | 83.4 | 9.13 |
B.1 Zero-Shot QA Format
We use the default lm-evaluation-harness zero-shot format for MMLU and WMDP, taking the top logit between A, B, C, and D as the answer choice. For all HuggingFace models, we use lm-evaluation-harness v0.4.2; for GPT-4, we manually evaluated with the same prompt template.
An example is as follows:
B.2 MT-Bench
We employ the single evaluation mode on MT-Bench, using gpt-4-0613 as the judge.
B.3 Robustness Evaluation
We describe the results of the robustness evaluation experiment (Section B.3), with jailbreak results for the RMU unlearned and base Yi-34b models.
B.3.1 RMU Unlearned Model
We employ GCG [Zou et al., 2023b] with a 20-token adversarial suffix. Note that GCG requires a target answer output of multiple tokens, so rather tham constraining Yi-34b to answer in multiple choice format, we remove the answer choices and allow it to conduct a free-form generation.
For WMDP-Bio, we sampled the following reworded prompt:
and optimized over the GCG target phrase
After optimizing over the RMU unlearned Yi-34b model for steps, the resulting suffix and output for the WMDP-Bio prompt is
For WMDP-Cyber, we sampled the following (reworded) prompt:
and optimized over the GCG target phrase
After optimizing over the RMU unlearned Yi-34b model for steps, the resulting suffix and completion for the WMDP-Cyber prompt is
B.3.2 Base Model
The base model elicited a refusal when the WMDP-Bio prompt was asked directly:
Applying GCG for 50 optimization steps yielded the following suffix and successfully jailbroken completion:
The base model elicited a refusal when the WMDP-Cyber prompt was asked directly:
Applying GCG for 50 optimization steps yielded the following suffix and successfully jailbroken completion:
B.4 Updates to RMU
In the initial release, we introduced Cut, an unlearning method that employed steering vectors to guide model activations on hazardous knowledge towards a novice-like direction. After performing additional ablations, we identified that the performance of CUT is derived from increasing the norm of the activations, rather than steering towards a particular direction. Thus, we introduce RMU, a simplification to Cut which steers towards random vectors (of the same norm that Cut steered towards) and retains the same performance.
B.5 How RMU manipulates representations
As described in Section 4, the loss in RMU scales activation norms on hazardous data. To visualize this, we report the activation norms after unlearning biosecurity and cybersecurity with RMU in Figure 14 on Yi-34b.
The forget loss causes the updated model’s activations on (red) to blow up after around 200 steps of RMU, whereas our retain loss regularizes the updated model’s activations on the subject-specific sets (Appendix A.4 and A.5; solid blue) to be roughly similar to the frozen model’s activations on the subject-specific (dashed blue), suggesting that RMU preserves knowledge on benign data.
B.6 Generalization of RMU
We evaluate whether RMU prevents finetuning from recovering hazardous knowledge. Our work focuses on the closed-source threat model where LLM providers apply unlearning before LLM serving (Figure 2). We now consider the open-source threat model where LLM providers publicly release the LLM weights. In this setting, adversaries may finetune the model to attempt to recover hazardous capabilities.
We examine if RMU also prevents models from relearning unlearned knowledge through finetuning. In particular, we perform unlearning on Mistral-7B-v0.1 [Mistral AI team, 2023] and afterwards finetune on the cybersecurity forget corpus. In practice, we find it difficult to finetune zephyr-7b on our unlabeled corpus due to its instruction-tuning, so we use its base model, Mistral-7B-v0.1.
We finetune until the loss remains steady and report the results of finetuning in Figure 15. We see that RMU is unable to prevent finetuning from recovering performance, and we encourage future work to tackle the challenge of preventing relearning of unlearned knowledge through finetuning.
B.7 Baselines
We describe the baselines we employed, and any implementational details we employed for unlearning on RMU.
B.7.1 LLMU
We make several changes in adapting LLMU [Yao et al., 2023b] to our setting. We use bfloat16 for all floating point computations. In the unlearning process we do not stop after a prescribed maximum forget loss, rather stopping after unlearning for exactly a prescribed number of steps. Each sample of our dataset is truncated to 200 characters, and in the random loss we remove the question answer formatting, as our corpora does not follow this format. Using the hyperparameters for Llama 2 (7B) as a starting point, we employ low-rank adaptation [Hu et al., 2021], a batch size of 2, a random weight of 1, and a normal weight of 1. We apply a grid search over the learning rates , the number of steps , and the forget weight .
B.7.2 SCRUB
Kurmanji et al. [2023] propose SCalable Remembering and Unlearning unBound (SCRUB) for image classification. It uses the original model as a frozen teacher and clones it to form a student model that is adapted for unlearning. SCRUB cycles between forget data and retain data epochs, maximizing KL divergence of logits between the student and teacher model on the forget set, and minimizing it on the retain set. The retain set epochs also includes a task-specific loss with gold labels to maintain performance. We use the same forget set and retain sets as the RMU experiments, and with log perplexity on Wikitext as the task-specific loss. We tune the hyperparameter at values , to search over loss weightings between knowledge distillation and the task-specific loss. We do this as a grid search with learning rates being . We use unlearning steps in total, doing the forget step only for as it is recommended in Kurmanji et al. [2023] to stop it earlier. In the high learning rate case, i.e. we also try doing only unlearning steps in total, with only forget steps. Other than that, we use the same hyperparameters as those reported for LLMU above. Goel et al. [2024] have shown that SCRUB performs poorly when most training samples relevant to removal are not available. This could be one of the reasons why SCRUB performs poorly in our setting.
B.7.3 SSD
Selective Synaptic Dampening (SSD) [Foster et al., 2024] belongs to a class of methods which find parameters in the model that are differentially more important for the forget set than the retain set. While the method was originally developed for image classification, we adapt it for autoregressive language modeling by altering the loss function to log-perplexity on the forget set and retain set. We grid-search on the threshold and constant for dampening , the two main hyperparameters for SSD. We converged on these ranges after initial manual hyperparameter exploration for our task and datasets.
B.7.4 RMU
We perform a hyperparameter search over the layer to perform the unlearning loss on, starting from the third layer and going to the last layer. We perform a grid search on the number of training batches (i.e., number of gradient updates) in the range of . We choose early layers for unlearning ( for zephyr-7b and Mixtral-8x7B, and for Yi-34b). We also tune the weight of the retain loss, setting it to be for zephyr-7b, for Yi-34b, and for Mixtral-8x7B. We set the unlearning coefficient to be , and respectively. We focus unlearning only on the MLPs, as those encode knowledge in the model.
Appendix C MMLU Subset Unlearning Benchmark
To enable further research on unlearning, we provide auxiliary benchmarks via unlearning certain subsets of MMLU, while retaining performance on the remainder of MMLU.
We offer three settings:
-
•
Economics: Unlearning on high school macroeconomics and high school microeconomics while retaining all other categories of MMLU.
-
•
Law: Unlearning on international law and professional law while retaining all other categories of MMLU.
-
•
Physics: Unlearning on high school physics, conceptual physics, and college physics while retaining all other categories of MMLU.
We specifically chose these settings to forget topics that were relatively separate from the remainder of MMLU, and contained a large enough sample size of forget set questions to benchmark on (more than questions).
We publicly release forget set corpora for all three of these settings. For each subject, a selection of textbooks with Creative Commons licenses were identified (ranging from high-school to graduate level). The text from these books was extracted and filtered to a set of paragraph-length chunks. The beginnings and end matter (table of contents, acknowledgements, index, etc.) of each book were excluded, as were most equations and exercises. Additional cleaning was performed to remove citations, links, and other artifacts.
Table 3 demonstrates the results of RMU unlearning for each setting. In the forget column, we report the accuracy for each setting, aggregated across all topics within the setting. For the retain column, we include closely related MMLU categories that should not be unlearned – College Mathematics and High School Mathematics for Physics, Jurisprudence for Law, and Econometrics for Economics. Lastly, we also report the aggregate MMLU performance before and after RMU unlearning.
Unlearning on Physics results in a significant performance drop on College Physics and High School Physics, and in a small variation on MMLU and Math related areas scores. Similar considerations hold for the forget, retain and MMLU performance after unlearning on Economics. However, we observe significant degradation in the Retain set performance while unlearning on Law, demonstrating the potential for future methods to improve unlearning precision.
Category | Forget | Retain | MMLU (Full) | |||
---|---|---|---|---|---|---|
Base | RMU | Base | RMU | Base | RMU | |
Physics | 38.8 | 27.0 | 34.6 | 29.2 | 58.6 | 57.1 |
Law | 56.7 | 27.8 | 71.3 | 37.0 | 58.6 | 54.5 |
Economics | 60.2 | 27.3 | 45.6 | 41.2 | 58.6 | 55.0 |
Appendix D Broader Impacts of WMDP
We reflect on how WMDP comports with the broader landscape of risk mitigation strategies.
From a policy-making perspective, we hope that WMDP guides the evaluation of hazards posed by ML systems, such as by informing the National Institutes of Standards and Technology’s AI Risk Management Framework [NIST, 2023, White House, 2023] or other frameworks. Moreover, WMDP may serve as risk marker for more stringent policy action. For example, a model scoring above a particular threshold on WMDP could be flagged for more comprehensive evaluation, such as human red teaming with biosecurity experts.
Furthermore, unlearning with WMDP may reduce general-purpose capabilities of models in biology or cybersecurity, which could hamper their utility for defensive, or beneficial, applications in those areas. Therefore, unlearning should be complemented with other safety interventions, such as structured access (Section 6.2). This is especially important for cybersecurity, as most cybersecurity knowledge may be used for both offensive and defensive purposes. For instance, AI progress could significantly enhance anomaly detection capabilities. This could aid attackers in disguising their activities to mimic normal usage patterns, but also inform critical infrastructure providers of atypical behavior that could signify an attack.
In biosecurity, however, there exist categories of primarily offensive knowledge that may be unlearned without significant degradation to defensive capabilities. For instance, knowledge of historical bioweapons programs may be safely removed from models without significantly affecting knowledge related to countermeasure development or general-purpose biology. As a result, while both WMDP-Bio and WMDP-Cyber are both useful measurements of hazardous language model capabilities, WMDP-Bio may be the most useful tool for risk mitigation via unlearning.
More broadly, there are other strategies, including non-technical strategies, that could be pursued to mitigate malicious use – such as implementing universal screening of synthetic DNA orders to prevent the widespread access to pathogen DNA, addressing gaps in the regulation of Select Agents in the Federal Select Agent Program, and improving oversight of laboratory automation and outsourcing.
D.1 Limitations
WMDP consists of four-way multiple choice questions, potentially neglecting hazards that only surface in larger end-to-end evaluations. For instance, models that have memorized key biological concepts from the training data may be equally likely to do well on a particular multiple choice question as are models that have a true understanding of the underlying concept. Memorized facts may be particularly over-represented in our biological benchmark since many questions that were developed were drawn from open-access papers that were likely also included in the model’s training data. In addition, multiple choice questions only test for whether the model retains hazardous knowledge; these questions do not test whether the model will reveal that information to the end-user in a helpful and timely manner during the planning or execution of a nefarious attack. To address these limitations, future work in this area could include generating questions from scientific papers that were only released after a model’s training date cutoff, or using other strategies to generate questions which are difficult to search [Rein et al., 2023, Lála et al., 2023].
WMDP is a static benchmark which cannot anticipate the evolving landscape of cyber and biological risks, as threats continuously change and new technologies emerge. Moreover, as with any metric, scores on WMDP do not capture the full extent of malicious use risk. As a result, benchmarking on only WMDP may yield a false sense of model safety after unlearning. This limitation emphasizes the need for other safety benchmarks to complement WMDP, especially as new risks emerge over time. For instance, benchmarks that assess open-ended conversations may be a more promising method to assess capabilities of future models.
WMDP focuses on reducing risk for API-access models (Section 1); for models with publicly downloadable weights, unlearned information can be trivially re-introduced by malicious actors [Lynch et al., 2024]. If open-source models reach similar capabilities to closed-source models in the future, these risks will remain unaddressed by this work.
Appendix E X-Risk Sheet
We provide an analysis of how our paper contributes to reducing existential risk from AI, following the framework suggested by Hendrycks and Mazeika [2022]. Individual question responses do not decisively imply relevance or irrelevance to existential risk reduction.
E.1 Long-Term Impact on Advanced AI Systems
In this section, please analyze how this work shapes the process that will lead to advanced AI systems and how it steers the process in a safer direction.
Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We aim to reduce biological malicious use, as the proliferation of bioweapons could increase the risk of a catastrophic pandemic, potentially causing civilizational collapse [Gopal et al., 2023].
Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicious actors with the skill and access to engineer pandemics or launch cyberattacks on critical infrastructure (Section 3).
Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infrastructure, could be catastrophic. They are a diffuse contributor to economic turbulence and political instability [Forum, 2024], which may increase the risk of great power conflict, which in turn would likely increase the probability of an existential catastrophe. Unlearning may be applied to prevent other hazardous properties of ML models, such as situational awareness.
What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destruction [Guembe et al., 2022, Gopal et al., 2023, OpenAI, 2024].
Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? 6. Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task? 7. Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? 8. Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility?
E.2 Safety-Capabilities Balance
In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities.
-
9.
Overview. How does this improve safety more than it improves general capabilities?
Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety.
Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risks, we conduct an extensive sensitive information mitigation process (Section 3.5).
General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research?
General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities?
Correlation with General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment?
Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI?
E.3 Elaborations and Other Considerations
-
15.
Other. What clarifications or uncertainties about this work and x-risk are worth mentioning?
Answer: While unlearning is an important intervention for reducing model hazards, unlearning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D).