Exploring the Design Space of Visual Context Representation in Video MLLMs

Du, Yifan; Huo, Yuqi; Zhou, Kun; Zhao, Zijia; Lu, Haoyu; Huang, Han; Zhao, Wayne Xin; Wang, Bingning; Chen, Weipeng; Wen, Ji-Rong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.13694 (cs)

[Submitted on 17 Oct 2024]

Title:Exploring the Design Space of Visual Context Representation in Video MLLMs

Authors:Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

View PDF HTML (experimental)

Abstract:Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments. We examine the effectiveness of typical selection strategies and present empirical findings to determine the two factors. Furthermore, we study the joint effect of frame selection and token selection, and derive the optimal formula for determining the two factors. We demonstrate that the derived optimal settings show alignment with the best-performed results of empirical experiments. Our code and model are available at: this https URL.

Comments:	Long Video MLLM; work in progress
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.13694 [cs.CV]
	(or arXiv:2410.13694v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.13694

Submission history

From: Yifan Du [view email]
[v1] Thu, 17 Oct 2024 15:59:52 UTC (1,371 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Design Space of Visual Context Representation in Video MLLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Design Space of Visual Context Representation in Video MLLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators