Edit As You Wish: Video Caption Editing with Multi-grained User Control

Yao, Linli; Zhang, Yuanmeng; Wang, Ziheng; Hou, Xinglin; Ge, Tiezheng; Jiang, Yuning; Sun, Xu; Jin, Qin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.08389 (cs)

[Submitted on 15 May 2023 (v1), last revised 8 Aug 2024 (this version, v3)]

Title:Edit As You Wish: Video Caption Editing with Multi-grained User Control

Authors:Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Xu Sun, Qin Jin

View PDF HTML (experimental)

Abstract:Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are available at this https URL.

Comments:	Accepted by ACM MM 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2305.08389 [cs.CV]
	(or arXiv:2305.08389v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.08389

Submission history

From: Linli Yao [view email]
[v1] Mon, 15 May 2023 07:12:19 UTC (10,570 KB)
[v2] Mon, 3 Jun 2024 07:47:36 UTC (6,819 KB)
[v3] Thu, 8 Aug 2024 09:28:22 UTC (7,707 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Edit As You Wish: Video Caption Editing with Multi-grained User Control

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Edit As You Wish: Video Caption Editing with Multi-grained User Control

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators