Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Huang, Ziyuan; Ji, Kaixiang; Gong, Biao; Qing, Zhiwu; Zhang, Qinglong; Zheng, Kecheng; Wang, Jian; Chen, Jingdong; Yang, Ming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15819 (cs)

[Submitted on 22 Jul 2024]

Title:Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Authors:Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang

View PDF HTML (experimental)

Abstract:This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.15819 [cs.CV]
	(or arXiv:2407.15819v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15819

Submission history

From: Ziyuan Huang [view email]
[v1] Mon, 22 Jul 2024 17:33:49 UTC (305 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators