JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

Zhang, Mi; Pan, Xudong; Yang, Min

Computer Science > Computation and Language

arXiv:2311.00286 (cs)

[Submitted on 1 Nov 2023 (v1), last revised 10 Dec 2023 (this version, v3)]

Title:JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

Authors:Mi Zhang, Xudong Pan, Min Yang

View PDF HTML (experimental)

Abstract:In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: this https URL. For readers who are interested in evaluating on more questions generated by JADE, please contact us.
JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: this https URL.

Comments:	A preprint work. Benchmark link: this https URL. Website link: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as:	arXiv:2311.00286 [cs.CL]
	(or arXiv:2311.00286v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.00286

Submission history

From: Xudong Pan [view email]
[v1] Wed, 1 Nov 2023 04:36:45 UTC (12,940 KB)
[v2] Thu, 2 Nov 2023 02:36:47 UTC (12,939 KB)
[v3] Sun, 10 Dec 2023 13:58:24 UTC (12,939 KB)

Computer Science > Computation and Language

Title:JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators