Question-Answering Dense Video Events

Qin, Hangyu; Xiao, Junbin; Yao, Angela

doi:10.1145/3726302.3729945

Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.04388 (cs)

[Submitted on 6 Sep 2024 (v1), last revised 16 May 2025 (this version, v5)]

Title:Question-Answering Dense Video Events

Authors:Hangyu Qin, Junbin Xiao, Angela Yao

View PDF HTML (experimental)

Abstract:This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at this https URL.

Comments:	Accepted to SIGIR'25
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Cite as:	arXiv:2409.04388 [cs.CV]
	(or arXiv:2409.04388v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.04388
Related DOI:	https://doi.org/10.1145/3726302.3729945

Submission history

From: Hangyu Qin [view email]
[v1] Fri, 6 Sep 2024 16:27:52 UTC (24,881 KB)
[v2] Mon, 9 Sep 2024 13:15:41 UTC (24,870 KB)
[v3] Tue, 10 Sep 2024 09:46:58 UTC (24,870 KB)
[v4] Wed, 7 May 2025 14:35:23 UTC (13,025 KB)
[v5] Fri, 16 May 2025 08:24:31 UTC (13,138 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Question-Answering Dense Video Events

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Question-Answering Dense Video Events

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators