SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

Xia, Haotian; Yang, Zhengbang; Zou, Junbo; Tracy, Rhys; Wang, Yuqing; Lu, Chi; Lai, Christopher; He, Yanjun; Shao, Xun; Xie, Zhuoqing; Wang, Yuan-fang; Shen, Weining; Chen, Hanjie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.08474 (cs)

[Submitted on 11 Oct 2024 (v1), last revised 4 Dec 2024 (this version, v3)]

Title:SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

Authors:Haotian Xia, Zhengbang Yang, Junbo Zou, Rhys Tracy, Yuqing Wang, Chi Lu, Christopher Lai, Yanjun He, Xun Shao, Zhuoqing Xie, Yuan-fang Wang, Weining Shen, Hanjie Chen

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tasks. SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice questions with human-annotated explanations for rule comprehension and strategy understanding. This component focuses on testing models' ability to reason about sports solely through question-answering (QA), without requiring visual inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7 different sports and 12,048 QA pairs, designed to assess multi-level reasoning, from simple sports recognition to complex tasks like foul detection and rule application. We evaluate four prevalent LLMs mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT) prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but still falls short of human-level performance, highlighting room for improvement in rule comprehension and reasoning. The evaluation for the SPORTU-video part includes 7 proprietary and 6 open-source MLLMs. Experiments show that models fall short on hard tasks that require deep reasoning and rule-based understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on the hard task, showing large room for improvement. We hope that SPORTU will serve as a critical step toward evaluating models' capabilities in sports understanding and reasoning.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.08474 [cs.CV]
	(or arXiv:2410.08474v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.08474

Submission history

From: Haotian Xia [view email]
[v1] Fri, 11 Oct 2024 02:58:38 UTC (38,058 KB)
[v2] Sat, 19 Oct 2024 08:17:17 UTC (38,058 KB)
[v3] Wed, 4 Dec 2024 00:43:57 UTC (42,501 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators