Tamper-Resistant Safeguards for Open-Weight LLMs

Tamirisa, Rishub; Bharathi, Bhrugu; Phan, Long; Zhou, Andy; Gatti, Alice; Suresh, Tarun; Lin, Maxwell; Wang, Justin; Wang, Rowan; Arel, Ron; Zou, Andy; Song, Dawn; Li, Bo; Hendrycks, Dan; Mazeika, Mantas

Computer Science > Machine Learning

arXiv:2408.00761 (cs)

[Submitted on 1 Aug 2024 (v1), last revised 14 Sep 2024 (this version, v3)]

Title:Tamper-Resistant Safeguards for Open-Weight LLMs

Authors:Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, Mantas Mazeika

View PDF HTML (experimental)

Abstract:Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. We develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, we find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

Comments:	Website: this https URL
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2408.00761 [cs.LG]
	(or arXiv:2408.00761v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2408.00761

Submission history

From: Mantas Mazeika [view email]
[v1] Thu, 1 Aug 2024 17:59:12 UTC (1,527 KB)
[v2] Thu, 8 Aug 2024 22:46:04 UTC (576 KB)
[v3] Sat, 14 Sep 2024 02:43:00 UTC (576 KB)

Computer Science > Machine Learning

Title:Tamper-Resistant Safeguards for Open-Weight LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Tamper-Resistant Safeguards for Open-Weight LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators