Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Lai, Zhengfeng; Saveris, Vasileios; Chen, Chen; Chen, Hong-You; Zhang, Haotian; Zhang, Bowen; Tebar, Juan Lao; Hu, Wenze; Gan, Zhe; Grasch, Peter; Cao, Meng; Yang, Yinfei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.02740 (cs)

[Submitted on 3 Oct 2024]

Title:Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Authors:Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang

View PDF HTML (experimental)

Abstract:Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

Comments:	CV/ML
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2410.02740 [cs.CV]
	(or arXiv:2410.02740v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.02740

Submission history

From: Zhengfeng Lai [view email]
[v1] Thu, 3 Oct 2024 17:54:52 UTC (18,107 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators