Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Fan, Chongyu; Liu, Jiancheng; Lin, Licong; Jia, Jinghan; Zhang, Ruiqi; Mei, Song; Liu, Sijia

Computer Science > Computation and Language

arXiv:2410.07163 (cs)

[Submitted on 9 Oct 2024 (v1), last revised 7 Feb 2025 (this version, v3)]

Title:Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Authors:Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu

View PDF HTML (experimental)

Abstract:This work studies the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences (e.g., copyrighted or harmful content) while preserving model utility. Despite the increasing demand for unlearning, a technically-grounded optimization framework is lacking. Gradient ascent (GA)-type methods, though widely used, are suboptimal as they reverse the learning process without controlling optimization divergence (i.e., deviation from the pre-trained state), leading to risks of over-forgetting and potential model collapse. Negative preference optimization (NPO) has been proposed to address this issue and is considered one of the state-of-the-art LLM unlearning approaches. In this work, we revisit NPO and identify another critical issue: reference model bias. This bias arises from using the reference model (i.e., the model prior to unlearning) to evaluate the unlearning success, which can compromise NPO's effectiveness. Specifically, it leads to (a) uneven allocation of optimization power across forget data with varying difficulty levels and (b) ineffective gradient weight smoothing during the early stages of unlearning optimization. To overcome these challenges, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that `simplicity' in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We provide deeper insights into SimNPO's advantages through an analysis based on mixtures of Markov chains. Extensive experiments further validate SimNPO's efficacy on benchmarks like TOFU and MUSE, as well as its robustness against relearning attacks. Codes are available at this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2410.07163 [cs.CL]
	(or arXiv:2410.07163v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.07163

Submission history

From: Chongyu Fan [view email]
[v1] Wed, 9 Oct 2024 17:58:12 UTC (5,134 KB)
[v2] Mon, 28 Oct 2024 19:55:24 UTC (5,134 KB)
[v3] Fri, 7 Feb 2025 18:34:28 UTC (5,375 KB)

Computer Science > Computation and Language

Title:Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators