Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Zhang, Haofei; Duan, Jiarui; Xue, Mengqi; Song, Jie; Sun, Li; Song, Mingli

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.03552 (cs)

[Submitted on 7 Dec 2021 (v1), last revised 26 Mar 2022 (this version, v4)]

Title:Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Authors:Haofei Zhang, Jiarui Duan, Mengqi Xue, Jie Song, Li Sun, Mingli Song

View PDF

Abstract:Recently, vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in the realm of computer vision (CV). With the general-purpose Transformer architecture replacing the hard-coded inductive biases of convolution, ViTs have surpassed CNNs, especially in data-sufficient circumstances. However, ViTs are prone to over-fit on small datasets and thus rely on large-scale pre-training, which expends enormous time. In this paper, we strive to liberate ViTs from pre-training by introducing CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives. To begin with, an agent CNN is designed based on the given ViT with inductive biases. Then a bootstrapping training algorithm is proposed to jointly optimize the agent and ViT with weight sharing, during which the ViT learns inductive biases from the intermediate features of the agent. Extensive experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results that the inductive biases help ViTs converge significantly faster and outperform conventional CNNs with even fewer parameters. Our code is publicly available at this https URL.

Comments:	Accepted as a conference paper by CVPR2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2112.03552 [cs.CV]
	(or arXiv:2112.03552v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.03552

Submission history

From: Haofei Zhang [view email]
[v1] Tue, 7 Dec 2021 07:56:50 UTC (1,594 KB)
[v2] Thu, 9 Dec 2021 16:28:53 UTC (1,594 KB)
[v3] Wed, 23 Mar 2022 07:39:04 UTC (2,948 KB)
[v4] Sat, 26 Mar 2022 03:59:50 UTC (2,953 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators