Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

He, Dongliang; Zhao, Xiang; Huang, Jizhou; Li, Fu; Liu, Xiao; Wen, Shilei

Computer Science > Computer Vision and Pattern Recognition

arXiv:1901.06829 (cs)

[Submitted on 21 Jan 2019]

Title:Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Authors:Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, Shilei Wen

View PDF

Abstract:The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

Comments:	AAAI 2019
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1901.06829 [cs.CV]
	(or arXiv:1901.06829v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1901.06829

Submission history

From: Dongliang He [view email]
[v1] Mon, 21 Jan 2019 09:00:11 UTC (632 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2019-01

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Dongliang He
Xiang Zhao
Jizhou Huang
Fu Li
Xiao Liu

…

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators