Finding Moments in Video Collections Using Natural Language

Escorcia, Victor; Soldan, Mattia; Sivic, Josef; Ghanem, Bernard; Russell, Bryan

Computer Science > Computer Vision and Pattern Recognition

arXiv:1907.12763 (cs)

[Submitted on 30 Jul 2019 (v1), last revised 23 Feb 2022 (this version, v2)]

Title:Finding Moments in Video Collections Using Natural Language

Authors:Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

View PDF

Abstract:We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents a video moment as a set of regions within a series of short video clips and aligns a natural language query to the moment's regions. Our alignment cost compares variable-length language and video features using symmetric squared Chamfer distance, which allows for efficient indexing and retrieval of the video moments. Moreover, aligning language features to regions within a video moment allows for finer alignment compared to methods that extract only an aggregate feature from the entire video moment. We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting. We show that our STAL re-ranking model outperforms the recently proposed Moment Context Network on all criteria across all datasets on our proposed task, obtaining relative gains of 37% - 118% for average recall and up to 30% for median rank. Moreover, our approach achieves more than 130x faster retrieval and 8x smaller index size with a 1M video corpus in an approximate setting.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:1907.12763 [cs.CV]
	(or arXiv:1907.12763v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1907.12763

Submission history

From: Mattia Soldan [view email]
[v1] Tue, 30 Jul 2019 07:31:02 UTC (3,701 KB)
[v2] Wed, 23 Feb 2022 12:44:54 UTC (2,548 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Finding Moments in Video Collections Using Natural Language

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Finding Moments in Video Collections Using Natural Language

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators