ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yan, Yibo; Wang, Shen; Huo, Jiahao; Li, Hang; Li, Boyan; Su, Jiamin; Gao, Xiong; Zhang, Yi-Fan; Xu, Tianlong; Chu, Zhendong; Zhong, Aoxiao; Wang, Kun; Xiong, Hui; Yu, Philip S.; Hu, Xuming; Wen, Qingsong

Computer Science > Computation and Language

arXiv:2410.04509 (cs)

[Submitted on 6 Oct 2024 (v1), last revised 8 Oct 2024 (this version, v2)]

Title:ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Authors:Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen

View PDF HTML (experimental)

Abstract:As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with rigorous annotation and rich metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation. The dataset will be available upon acceptance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.04509 [cs.CL]
	(or arXiv:2410.04509v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.04509

Submission history

From: Yibo Yan [view email]
[v1] Sun, 6 Oct 2024 14:59:09 UTC (4,809 KB)
[v2] Tue, 8 Oct 2024 06:03:46 UTC (4,809 KB)

Computer Science > Computation and Language

Title:ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators