Localizing Moments in Long Video Via Multimodal Guidance

Barrios, Wayner; Soldan, Mattia; Ceballos-Arroyo, Alberto Mario; Heilbron, Fabian Caba; Ghanem, Bernard

Computer Science > Computer Vision and Pattern Recognition

arXiv:2302.13372 (cs)

[Submitted on 26 Feb 2023 (v1), last revised 15 Oct 2023 (this version, v2)]

Title:Localizing Moments in Long Video Via Multimodal Guidance

Authors:Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

View PDF

Abstract:The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2302.13372 [cs.CV]
	(or arXiv:2302.13372v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2302.13372
Journal reference:	Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2023

Submission history

From: Wayner Barrios [view email]
[v1] Sun, 26 Feb 2023 18:19:24 UTC (3,910 KB)
[v2] Sun, 15 Oct 2023 13:48:59 UTC (7,422 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Localizing Moments in Long Video Via Multimodal Guidance

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Localizing Moments in Long Video Via Multimodal Guidance

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators