Weight subcloning: direct initialization of transformers using larger pretrained ones

Samragh, Mohammad; Farajtabar, Mehrdad; Mehta, Sachin; Vemulapalli, Raviteja; Faghri, Fartash; Naik, Devang; Tuzel, Oncel; Rastegari, Mohammad

Computer Science > Machine Learning

arXiv:2312.09299 (cs)

[Submitted on 14 Dec 2023]

Title:Weight subcloning: direct initialization of transformers using larger pretrained ones

Authors:Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari

View PDF HTML (experimental)

Abstract:Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models.
Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.09299 [cs.LG]
	(or arXiv:2312.09299v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.09299

Submission history

From: Mohammad Samragh [view email]
[v1] Thu, 14 Dec 2023 19:08:56 UTC (1,257 KB)

Computer Science > Machine Learning

Title:Weight subcloning: direct initialization of transformers using larger pretrained ones

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Weight subcloning: direct initialization of transformers using larger pretrained ones

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators