VeCLIP: Improving CLIP Training via Visual-enriched Captions

Lai, Zhengfeng; Zhang, Haotian; Zhang, Bowen; Wu, Wentao; Bai, Haoping; Timofeev, Aleksei; Du, Xianzhi; Gan, Zhe; Shan, Jiulong; Chuah, Chen-Nee; Yang, Yinfei; Cao, Meng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.07699 (cs)

[Submitted on 11 Oct 2023 (v1), last revised 13 Mar 2024 (this version, v3)]

Title:VeCLIP: Improving CLIP Training via Visual-enriched Captions

Authors:Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao

View PDF HTML (experimental)

Abstract:Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M. This study introduces a scalable pipeline for noisy caption rewriting. Unlike recent LLM rewriting techniques, we emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap). To ensure data diversity, we propose a novel mixed training scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP. Employing this cost-effective pipeline, we effortlessly scale our dataset up to 300 million samples named VeCap dataset. Our results show significant advantages in image-text alignment and overall model performance. For example, VeCLIP achieves up to +25.2% gain in COCO and Flickr30k retrieval tasks under the 12M setting. For data efficiency, VeCLIP achieves +3% gain while only using 14% of the data employed in the vanilla CLIP and 11% in ALIGN. We also note the VeCap data is complementary with other well curated datasets good for zero-shot classification tasks. When combining VeCap and DFN, our model can achieve strong performance on both of image-text retrieval and zero-shot classification tasks, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 model. We release the pre-trained models at this https URL.

Comments:	CV/ML
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.07699 [cs.CV]
	(or arXiv:2310.07699v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.07699

Submission history

From: Zhengfeng Lai [view email]
[v1] Wed, 11 Oct 2023 17:49:13 UTC (26,653 KB)
[v2] Thu, 7 Mar 2024 18:25:39 UTC (27,562 KB)
[v3] Wed, 13 Mar 2024 22:27:08 UTC (27,562 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VeCLIP: Improving CLIP Training via Visual-enriched Captions

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VeCLIP: Improving CLIP Training via Visual-enriched Captions

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators