BUST: Benchmark for the evaluation of detectors of LLM-Generated Text

Joseph Cornelius; Oscar Lithgow-Serrano; Sandra Mitrović; Ljiljana Dolamic; Fabio Rinaldi

doi:10.18653/v1/2024.naacl-long.444

BUST: Benchmark for the evaluation of detectors of LLM-Generated Text

Joseph Cornelius, Oscar Lithgow-Serrano, Sandra Mitrovic, Ljiljana Dolamic, Fabio Rinaldi

Abstract

We introduce BUST, a comprehensive benchmark designed to evaluate detectors of texts generated by instruction-tuned large language models (LLMs). Unlike previous benchmarks, our focus lies on evaluating the performance of detector systems, acknowledging the inevitable influence of the underlying tasks and different LLM generators. Our benchmark dataset consists of 25K texts from humans and 7 LLMs responding to instructions across 10 tasks from 3 diverse sources. Using the benchmark, we evaluated 5 detectors and found substantial performance variance across tasks. A meta-analysis of the dataset characteristics was conducted to guide the examination of detector performance. The dataset was analyzed using diverse metrics assessing linguistic features like fluency and coherence, readability scores, and writer attitudes, such as emotions, convincingness, and persuasiveness. Features impacting detector performance were investigated with surrogate models, revealing emotional content in texts enhanced some detectors, yet the most effective detector demonstrated consistent performance, irrespective of writer’s attitudes and text styles. Our approach focused on investigating relationships between the detectors’ performance and two key factors: text characteristics and LLM generators. We believe BUST will provide valuable insights into selecting detectors tailored to specific text styles and tasks and facilitate a more practical and in-depth investigation of detection systems for LLM-generated text.

Anthology ID:: 2024.naacl-long.444
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8029–8057
Language:
URL:: https://aclanthology.org/2024.naacl-long.444
DOI:: 10.18653/v1/2024.naacl-long.444
Bibkey:
Cite (ACL):: Joseph Cornelius, Oscar Lithgow-Serrano, Sandra Mitrovic, Ljiljana Dolamic, and Fabio Rinaldi. 2024. BUST: Benchmark for the evaluation of detectors of LLM-Generated Text. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8029–8057, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: BUST: Benchmark for the evaluation of detectors of LLM-Generated Text (Cornelius et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.444.pdf

PDF Cite Search