GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Ding, Ning; Tang, Yehui; Fu, Zhongqian; Xu, Chao; Han, Kai; Wang, Yunhe

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.00693 (cs)

[Submitted on 1 Jun 2023 (v1), last revised 27 Feb 2025 (this version, v3)]

Title:GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Authors:Ning Ding, Yehui Tang, Zhongqian Fu, Chao Xu, Kai Han, Yunhe Wang

View PDF HTML (experimental)

Abstract:The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations and achieve higher performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptions for training images. Then, these detailed descriptions are fed into a pre-trained encoder to extract text embeddings that encodes the rich semantics of images. During training, text embeddings will serve as extra supervising signal and be aligned with image representations learned by vision models. The alignment process helps vision models achieve better performance with the aid of pre-trained LLMs. We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks for heterogeneous model architectures.

Comments:	GitHub: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.00693 [cs.CV]
	(or arXiv:2306.00693v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.00693

Submission history

From: Ning Ding [view email]
[v1] Thu, 1 Jun 2023 14:02:45 UTC (1,340 KB)
[v2] Wed, 7 Jun 2023 13:59:25 UTC (1,340 KB)
[v3] Thu, 27 Feb 2025 12:49:05 UTC (5,127 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators