0% found this document useful (0 votes)
11 views18 pages

2031 Learning To Grow Pretrained Mo

Uploaded by

Al Rs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

2031 Learning To Grow Pretrained Mo

Uploaded by

Al Rs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Published as a conference paper at ICLR 2023

L EARNING TO G ROW P RETRAINED M ODELS FOR


E FFICIENT T RANSFORMER T RAINING
Peihao Wang1∗ Rameswar Panda2 Lucas Torroba Hennigen4 Philip Greengard3
Leonid Karlinsky2 Rogerio Feris2 David D. Cox2 Zhangyang Wang1 Yoon Kim4
1
University of Texas at Austin, 2 MIT-IBM Watson AI Lab, 3 Columbia University, 4 MIT
{peihaowang,atlaswang}@utexas.edu, {rpanda, leonidka, david.d.cox}@ibm.com,
[email protected], [email protected], {lucastor, yoonkim}@mit.edu

A BSTRACT
Scaling transformers has led to significant breakthroughs in many domains, lead-
ing to a paradigm in which larger versions of existing models are trained and
released on a periodic basis. New instances of such models are typically trained
completely from scratch, despite the fact that they are often just scaled-up ver-
sions of their smaller counterparts. How can we use the implicit knowledge in
the parameters of smaller, extant models to enable faster training of newer, larger
models? This paper describes an approach for accelerating transformer training
by learning to grow pretrained transformers, where we learn to linearly map the
parameters of the smaller model to initialize the larger model. For tractable learn-
ing, we factorize the linear transformation as a composition of (linear) width-
and depth-growth operators, and further employ a Kronecker factorization of
these growth operators to encode architectural knowledge. Extensive experiments
across both language and vision transformers demonstrate that our learned Lin-
ear Growth Operator (LiGO) can save up to 50% computational cost of train-
ing from scratch, while also consistently outperforming strong baselines that also
reuse smaller pretrained models to initialize larger models.1

1 I NTRODUCTION
The transformer architecture (Vaswani et al., 2017) has emerged as a general purpose architecture
for modeling many structured domains (Devlin et al., 2019; Brown et al., 2020; Rives et al., 2021;
Dosovitskiy et al., 2021; Touvron et al., 2021a). Perhaps more so than other architectures, the
transformer empirically seems to have inductive biases that make it especially amenable to scaling
(Rosenfeld et al., 2019; Kaplan et al., 2020), which has led to a paradigm in which larger versions of
smaller, existing models are trained and released on a periodic basis (e.g., the GPT lineage of models
(Radford et al., 2018; 2019; Brown et al., 2020)). New instances of such models are typically trained
completely from scratch, despite the fact that they are often scaled-up versions of their smaller
counterparts. Given the compute required to train even the smaller models, we argue that training
each model from scratch is wasteful, and that prior knowledge implicit in the parameters of smaller
pretrained models should be leveraged to enable faster training of larger models.
One approach to this problem is through the lens of model growth, wherein a smaller model’s pre-
trained parameters are used to initialize a subset of the larger model’s parameters. While earlier
works generally froze the parameters initialized from the pretrained model and only trained the new
(randomly initialized) parameters (Fahlman & Lebiere, 1989; Fahlman, 1990; Gutstein et al., 2008),
subsequent work has shown that copying a subset of the pretrained parameters to initialize the new
parameters and then finetuning the entire network significantly accelerates training and sometimes
even leads to better performance (Chen et al., 2015). When applied to modern transformers, these
mechanisms roughly translate to a depth-expansion operator in which pretrained models are stacked
(or combined with identity layers) to initialize deeper transformers (Gong et al., 2019; Yang et al.,
2020), and a width-expansion operator in which the smaller model’s matrices are copied to initialize
the larger model’s matrices (e.g., in block-diagonal fashion) (Chen et al., 2021; Gu et al., 2020).


Work done during an internship at MIT-IBM Watson AI Lab.
1
Project page: https://vita-group.github.io/LiGO/

1
Published as a conference paper at ICLR 2023

Figure 1: Our linear growth operator (LiGO) accelerates training by using the weights of a smaller model Θ
to initialize the weights of the larger model Θ(new) . LiGO is parameterized as a sparse linear map M that
can be decomposed into width- and depth-expansion operators. The width-operator Rwidth and depth-operator
Ldepth are structured matrices obtained from Kronecker products of smaller matrices which encode architec-
tural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for
simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of
transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to
initialize the larger model, and then continue training as usual. Best viewed in color.

Noting the empirical effectiveness of such recipes, we observe that existing mechanisms generally
do not have a learning component (e.g., randomly copying over neurons for width-expansion or
stacking consecutive layers for depth-expansion). This paper instead proposes an efficient, data-
driven approach for learning to grow transformers. In particular, our approach frames the problem
of initializing the larger model’s parameters as learning a linear mapping from the smaller model’s
parameters, i.e., Θ(large) = M Θ(small) where Θ(small) and Θ(large) are the vectorized parame-
ters of the small/large models. Due to the high dimensionality of the parameters, this mapping is
completely intractable to learn without any restrictions on M . We thus factorize the linear mapping
to be a composition of sparse width- and depth-expansion operators, M = Ldepth Rwidth , where
both width and depth matrices are further factorized to be a Kronecker product of smaller matri-
ces that express architectural knowledge (e.g., through grouping parameters by layers and neurons).
We show that our growth operators can represent existing approaches such as layer-stacking and
neuron-copying as special cases. We find that with a small amount of learning on M (e.g., 100
gradient steps) to initialize the larger model, we can significantly accelerate training of both vision
and language transformers. Figure 1 illustrates our approach.
We apply our learned linear growth operator (LiGO) to popular families of models—BERT (Devlin
et al., 2019), RoBERTa (Liu et al., 2019), GPT2 (Radford et al., 2019), and ViT (Dosovitskiy et al.,
2021; Touvron et al., 2021a;b)—and find that LiGO can consistently improve transformer training
efficiency over the traditional way of training from scratch across domains and model sizes. For
instance, LiGO saves 44.7% and 22.5% FLOPs for training BERT-Base and GPT2-Medium from
scratch by reusing pretrained smaller models that are half as big. Similarly, for vision transformers,
when using DeiT-S (Touvron et al., 2021a) for initialization, LiGO yields 55% savings in FLOPs
with no performance drop on ImageNet (Deng et al., 2009). These FLOPs savings directly trans-
late to similar wall clock savings. We further find that models trained using LiGO achieve similar
performance to the trained-from-scratch baselines when transferred to downstream tasks.

2 R ELATED W ORK
Efficient training. Efficient training of transformers has been studied from multiple perspectives.
Some methods that are orthogonal to our work include mixed precision training (Shoeybi et al.,
2019), large batch optimization (You et al., 2019), distributed training (Huang et al., 2019), and
dropping layers (Zhang & He, 2020) or tokens (Hou et al., 2022). Knowledge inheritance (Qin
et al., 2021) explores knowledge distillation during pretraining to efficiently learn larger transform-
ers. Progressive training, which first trains a small transformer with few layers and then gradually
expands by stacking layers, has also been applied to accelerate transformer training (Gong et al.,
2019; Yang et al., 2020; Li et al., 2022; Shen et al., 2022). Net2Net Chen et al. (2015) uses function-
preserving transformations to grow width by copying neurons and depth by using identity layers.
Recently, bert2BERT (Chen et al., 2021) extends Net2Net to transformers. In contrast to these ap-
proaches, our approach learns to (linearly) transform the parameters of a smaller model to initialize a

2
Published as a conference paper at ICLR 2023

larger model. While there is a line of work on learning to grow neural networks in a data-driven way,
these methods are in general difficult to apply to modern-scale transformers since they (for example)
involve growing a single neuron at a time or employ expensive optimization/search procedures (Wei
et al., 2016; Cai et al., 2018; Wu et al., 2019; 2021; Evci et al., 2022).
Network initialization. Our work is also related to work on neural network initialization. Exist-
ing works include controlling the norm of the parameters (Mishkin & Matas, 2015; Kilcher et al.,
2018; Dai et al., 2019; Wu et al., 2019; Glorot & Bengio, 2010) or replacing the normalization lay-
ers (Brock et al., 2021; Zhang et al., 2019; Huang et al., 2020). MetaInit (Dauphin & Schoenholz,
2019) proposes an automatic method that optimizes the norms of weight tensors to minimize the
gradient quotient on minibatches of random Gaussian samples. GradInit (Zhu et al., 2021) learns to
initialize larger networks by adjusting norm of each layer. Our work focuses on using smaller pre-
trained transformers to better initialize larger transformers, which remains an understudied problem.
Structured matrices. Finally, our work is also related to structured matrices which are typically
used to replace dense weight matrices for reducing training and inference computation cost. Ex-
amples include sparse and low rank matrices (Chiu et al., 2021; Han et al., 2015), Chebyshev
matrices (Tang et al., 2019), Toeplitz matrices (Sindhwani et al., 2015), Kronecker-product ma-
trices (Zhang et al., 2015), and butterfly matrices (Dao et al., 2019). A unified framework to learn
a broad family of structured matrices is presented in Sindhwani et al. (2015). Dao et al. (2022)
propose Monarch matrices, which inherit the expressiveness of butterfly matrices and achieve rea-
sonable accuracy-efficiency tradeoffs in many applications. While our approach is inspired by these
works, we propose to grow pretrained models by learning structured sparse linear operators with
Kronecker factorization, which to our knowledge has not been explored in the literature.

3 P ROPOSED A PPROACH
Notation. We denote the parameters of a neural network with L layers and D dimensions as

ΘL,D = [W 1 · · · W L ] ∈ RLD×D , where W l ∈ RD×D denotes the weights for the l-th
2
layer. With slight abuse of notation, we denote the vectorization of ΘL,D as vec(ΘL,D )⊤ =
vec(W 1 )⊤ · · · vec(W L )⊤ . Our goal is to re-use the parameters Θ = ΘL1 ,D1 from a pre-
 3

trained smaller model to initialize a large model Θ(new) = ΘL2 ,D2 through a model growth operator
M : RL1 D1 ×D1 → RL2 D2 ×D2 that maps the weights of the smaller network to the weights of the
larger one, i.e., Θ(new) = M (Θ) where L1 < L2 and D1 < D2 . After model growth, we adopt
Θ(new) as the initialization of the large model and train it using standard recipes.

3.1 E XISTING G ROWTH O PERATORS

Existing works have separately established model growth operators for depth (L1 < L2 , D1 = D2 )
and width (L1 = L2 , D1 < D2 ). We summarize these methods below.
Depth expansion. StackBERT (Gong et al., 2019) proposes to duplicate the smaller model to double
the depth, based on the observation that upper layers share similar functionality with the lower
layers. In contrast, interpolation-based depth expansion methods (Chang et al., 2017; Dong et al.,
2020) interleave every layer to form a deeper model, which can be roughly interpreted as simulating
a finer-grained solution to the original dynamical system from a neural ODE perspective (Chen et al.,
2018). Letting L2 = kL1 , the two methods’ growth operators can be formulated as:
(new) (new)
(StackBERT) W l = W l mod L1 , (Interpolation) W i = W ⌊l/k⌋ , ∀l ∈ [L2 ]. (1)
Width expansion. Net2Net (Chen et al., 2015) expands the width of neural networks by randomly
copying neurons while preserving output values via normalization. This can be seen as growing a
matrix associated with a particular layer by duplicating the columns and rows of its weight matrix.
(new)
Suppose a layer has weight matrix W l ∈ RD1 ×D1 .4 To expand it to a matrix W l ∈ RD2 ×D2
2
For notational brevity we assume that each hidden layer has same number of dimensions D, but LiGO can
be straightforwardly generalized to layers with different dimensions (e.g., FFN layers of transformers).
2
3
We therefore have vec(ΘL,D )⊤ ∈ RLD . Our approach is also agnostic with regard to vectorization order.
4
We define a single layer as fl (x) = W l x + bl , where the row number of W l corresponds to the output
dimension, and the column number of W l corresponds to the input dimension.

3
Published as a conference paper at ICLR 2023

(new)
(D2 > D1 ), Net2Net copies W l to its upper-left corner of W l , fills the new columns via a
random selection matrix S l , and finally duplicates and normalizes rows according to the selection
matrix from the previous layer. Formally, the growth operator of Net2Net can be written as:
 
(new) I
(Net2Net) W l = D −1
l W l [I S l ] , D l = diag(S l−1 1) + I, ∀l ∈ [L2 ] (2)
S⊤ l−1

where S l ∈ {0, 1}D1 ×(D2 −D1 ) is a random selection matrix. The diagonal of D l is a D1 -
dimensional histogram, whose i-th entry indicates number of times i-th column of W l was copied.

3.2 L EARNING TO G ROW WITH A S TRUCTURED L INEAR G ROWTH O PERATOR

While existing operators have been empirically successful in accelerating transformer-based models
such as BERT (Gong et al., 2019; Chen et al., 2021), we observe that generally do not have a learning
component and perform the depth- and width-expansions separately. In this section we introduce
a general framework for learning to grow with a linear growth operator (LiGO), which generalizes
existing operators by combining the width- and depth-growth operators in a data-driven way.
We can formulate the problem of initializing the weights of the larger model Θ(new) from the smaller
model Θ through the following optimization problem,
arg min Ex∼D L(x; Θ(new) ), subject to Θ(new) = M (Θ), (3)
M
where D is the data distribution and L is the loss function. It is of course intractable to optimize over
the entire operator space, and thus we further simplify the function M to be a linear transformation,
which results in the following formulation,
2 2
vec(Θ(new) ) = vec(M (Θ)) = M vec(Θ), M ∈ RL2 D2 ×L1 D1 . (4)

This simplified objective is still completely infeasible to apply to contemporary neural networks
where L1 D1 can easily be in the hundreds of millions. We therefore propose an efficient parameter-
ization of M for tractable learning.

3.2.1 D ECOMPOSITION ALONG D EPTH AND W IDTH

Our first step is to decompose the LiGO operator as M = Ldepth Rwidth , where Ldepth and Rwidth
expand the depth and width of model separately. Concretely, we decompose M as
  
diag(ℓ1,1 ) · · · diag(ℓ1,L1 ) R1
M = .. .. .. .. . (5)
  
. . .  .
diag(ℓL2 ,1 ) · · · diag(ℓL2 ,L1 ) R L1
| {z }| {z }
Ldepth Rwidth

D22 ×D12 D22


where Rl ∈ R and ℓi,j ∈ R . In the above, Ldepth is an array of diagonal matrices
and Rwidth is a block-diagonal matrix, i.e., both matrices are highly structured and sparse. When
applying Rwidth to weights vec(Θ), the parameters of each layer will be transformed independently
(new)
via vec(W l ) = Rl vec(W l ) and lifted to a higher dimension. The l-th row block of Ldepth
corresponds to the growth operator of l-th layer, which amounts to linearly combining all layers
(new) PL1
of the smaller model via vec(W l )k = l′ =1 (ℓl,l )k vec(W l )k . By this factorization, we

can effectively reduce the complexity of the LiGO operator from O(D12 L1 D22 L2 ) to O(D12 D22 L1 )
and encode architectural knowledge by grouping parameters by layers. Later in Section 3.4, this
representation is also shown to preserve high representation power owing to its connection with
Monarch matrices (Dao et al., 2022; 2019).

3.2.2 PARAMETER S HARING VIA K RONECKER FACTORIZATION

The above LiGO operator requires O(D12 D22 L1 ) parameters for Rwidth and O(L1 L2 D22 ) for
Ldepth . The width operator Rwidth is thus still prohibitively expensive given that D1 (and D2 )
can easily be in the hundreds or thousands. In this section, we propose a Kronecker factorization to
further reduce the number of learnable parameters for each growth operator.

4
Published as a conference paper at ICLR 2023

Depth. For depth, we treat an entire layer as a single group and construct a new layer by combining
existing layers, effectively tying parameters for all neurons in same layer. Formally, each block in
Ldepth is simplified to be diag(ℓi,j ) = wi,j I. Then the entire matrix can be written as a Kronecker
factorization, Ldepth = w ⊗ I, where w ∈ RL2 ×L1 is a matrix whose entry wi,j indicates blending
weights of j-th layer of the small model to form i-th layer of the large model. This strategy reduces
the number of parameters in Ldepth to O(L1 L2 ), and is shown on left-hand side of Figure 1.
Width. For width, we decompose each diagonal block of width expansion operator Rwidth using
the Kronecker factorization Rl = Al ⊗ B l , where Al , B l ∈ RD2 ×D1 . Since vec(CAB) =
(B ⊤ ⊗ C) vec(A) (Schacke, 2004), we then have,
 
A1 ⊗ B 1
Rwidth vec(Θ) = 
 ..  vec(Θ)

(6)
.
AL1 ⊗ B L1
⊤ ⊤
  
= vec B 1 W 1 A⊤ 1 · · · B L1 W A
L1 L1 . (7)

Here we observe that B l W l A⊤l performs in- and out-dimension expansion by Al and B l , respec-
tively. Each new column/row is a linear combination of columns/rows of small model’s weight
matrix. This factorization, which can be seen as grouping parameters by neurons, reduces the num-
ber of parameters to O(L1 D1 D2 ). Figure 1 (right) illustrates LiGO’s width-expansion operator.
Altogether, we obtain the final parameterization of LiGO operator M :
    
w1,1 w1,2 · · · w1,L1 A1 ⊗ B 1
M =  ... .. .. ..  ⊗ I   .. (8)
 
. . .    . 
wL2 ,1 wL2 ,2 · · · wL2 ,L1 AL1 ⊗ B L1
| {z }| {z }
Depth expansion Width expansion

We can exploit the factorization to implement the LiGO operator (Eq. 8) efficiently.
Training. LiGO expands a model in three steps: (1) for each layer, inserting new rows by lin-
early combining existing rows through B l , (2) for each layer, inserting new columns by linearly
combining existing columns through Al , and then finally (3) reconstructing each layer by linearly
combining the weight matrices with w along the depth. We then run a few steps (e.g., 100 itera-
tions) of SGD to optimize M , which has negligible compute cost relative to regular training. After
obtaining M , we initialize large model with M vec(Θ), and train parameters Θ(new) through SGD
as usual. Algorithm 1 summarizes a forward pass of LiGO with transformer. Finally, as shown in
Appendix A we note that StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all
special cases of LiGO (Eq. 8) with a particular setting of Ldepth and Rwidth .

3.3 L I GO FOR T RANSFORMERS

While LiGO can be applied to any multi-layer neural network architecture, in this paper we focus
on using LiGO to grow transformers which have been shown to be particularly amenable to scaling.
Below we briefly describe how LiGO is applied to the main transformer embedding/attention layers
and defer further details (e.g., growing bias vectors, layer norm parameters) to Appendix B.1.
Embedding layer. The embedding layer can be regarded as a linear layer whose inputs are one-hot
vectors. We learn a matrix B (emb) to extend its output dimension. This embedding layer is also
used as the final output layer for our transformer language modeling experiments.
Attention and feedforward Layers. An attention layer consists of multi-head attention weights
(W Q , W K , W V ) and a linear projection (W O ). Let Akl and B kl where k ∈ {Q, K, V, O} be the l-
th layer’s in- and out-dimension expansion matrices (Eq. 6) for the query, key, value, and projection
matrices. To make sure new input and output channels are aligned across modules, we tie the LiGO
operator as follows: for all l ∈ [L1 ], (1) Akl = (B (emb) )⊤ for ∀k ∈ {Q, K, V }, (2) AO V ⊤
l = (B l ) ,
O (emb)
(3) B l = B . The last constraint is added to take into account the residual connections (Chen
(f c1)
et al., 2021). We similarly tie parameters for the feed-forward networks, Al = (B (emb) )⊤ ,
(f c2) (f c2)
Al = (B (f c1) )⊤
l and B l = B (emb) . Since transformers make heavy use of residual layers

5
Published as a conference paper at ICLR 2023

with skip connections, we found that simply using the same B (emb) to parameterize Akl and B kl
for many layers/modules worked well in practice. This reduces the number of learnable parameters
even further and enables fast learning of M on a small amount of data (100 gradient steps).

3.4 C ONNECTION TO M ONARCH M ATRICES

As shown in Section 3.2.1, our depth-width decomposition factorizes M into a multiplication of


two structured sparse matrices. We examine the expressiveness of this factorized representation by
relating it to Monarch matrices (Dao et al., 2022), defined below.
Definition 1. Let the space of Monarch matrices be M ⊆ Rmn1 ×mn2 . Then matrix M ∈ M if
M = P 1 LP ⊤ ⊤
2 R = P 1 diag(L1 , · · · , Ln1 )P 2 diag(R1 , · · · , Rn2 ) where Li ∈ R
b1 ×b2
, Ri ∈
Rb3 ×b4 are dense rectangular matrices, and n1 b2 = n2 b3 . P 1 is the permutation π(i) = (i −
b1 ⌊ bi1 ⌋ − 1)n1 + ⌊ bi1 ⌋ + 1 and P 2 is the permutation π(j) = (j − b2 ⌊ bj2 ⌋ − 1)n1 + ⌊ bj2 ⌋ + 1.

It is clear that the block-diagonal matrix R has the identical form to our width growing operator
Rwidth . By applying the permutation matrices P 1 and P 2 to L, L is transformed into exactly
the same form with our depth-growth operator Ldepth in Eq. 5. This implies that our depth-width
decomposition coincides with Monarch sparsification of dense matrices, which generalize butterfly
matrices (Dao et al., 2019) and enjoy rich expressivity properties (Dao et al., 2020; 2022).

4 E XPERIMENTS
We conduct experiments to answer three key research questions. Q1: To what extent can LiGO
improve the training efficiency (FLOPs and wall time) of transformers compared to training from
scratch and other growth operators? Q2: Can LiGO be universally effective across transformers
from different domains (e.g., language and vision) and sizes? Q3: Can models trained using LiGO
achieve similar performance compared to the baselines when transferred to downstream tasks?

4.1 E XPERIMENTAL S ETUP

Datasets. We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training
BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). We use the public C4 (Raffel et al.,
2020) dataset for training GPT2 (Radford et al., 2019). We use ImageNet (Deng et al., 2009) for
training vision transformers. We use GLUE (Wang et al., 2018), SQuADv1.1 (Rajpurkar et al.,
2016), and SQuADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models. We test
downstream performance of vision transformers (DeiT (Touvron et al., 2021a)) by performing trans-
fer learning on 5 downstream image classification tasks, including CIFAR10 (Krizhevsky et al.,
2009), CIFAR100 (Krizhevsky et al., 2009), Flowers102 (Nilsback & Zisserman, 2008), Stanford-
Cars (Krause et al., 2013), and ChestXRay8 (Wang et al., 2017).
Models. We experiment with growing the following language andd vision transformers: (1)
BERT-Small→BERT-Base, BERT-Base→BERT-Large, BERT-Small→BERT-Large; (2) RoBERTa-
Small→RoBERTa-Base for RoBERTa; (3) GPT2-Base→GPT2-Medium, (4) DeiT-S→DeiT-B, and
(5) CaiT-XS→CaiT-S. BERT-Small has 6 layers with 512 hidden dimensions, while other named
models are their usual sizes. See Appendix B.2 for full details.
Baselines. We compare our approach with the following baselines: (1) training from scratch baseline
where we train the larger transformer without using any smaller pretrained models; (2) progressive
training methods designed for growing depth in transformers (StackBERT (Gong et al., 2019) and
MSLT (Yang et al., 2020)); (3) bert2BERT (Chen et al., 2021) that extends Net2Net (Chen et al.,
2015) for width expansion and stacking for depth expansion; (4) KI (Qin et al., 2021) which uses
distillation for transferring knowledge from the smaller model to the larger model.
Implementation details. We always use 100 gradient steps to learn the LiGO for all models, which
is negligible in terms of FLOPs/wall time compared to full training after initialization. We train both
BERT and RoBERTa models for 400K steps with a warmup of 10K steps. We remove the next-
sentence prediction task (Liu et al., 2019) and use a fixed sequence length of 128 for pretraining
5
While the original BERT (Devlin et al., 2019) paper also uses the Toronto Book Corpus (Zhu et al., 2015),
we do not include it here since it is no longer publicly available.

6
Published as a conference paper at ICLR 2023

3.0 3.0 3.0


Scratch KI Scratch KI Scratch
2.8 StackBert bert2BERT 2.8 StackBert bert2BERT 2.8 LiGO from Small
MSLT LiGO MSLT LiGO LiGO from Base
1.90 1.90
2.6 2.6 2.6 1.85
Log Perplexity

Log Perplexity

Log Perplexity
1.85 1.85 1.80
2.4 2.4 2.4
1.80
44.7% 0.0% 1.80 1.75 45.2% 30.3% 0.0%
2.2 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2.2 40.7% 0.0% 2.2 8 10 12 14 16
30 40 50 60
2.0 2.0 2.0
1.8 1.8 1.8
0 1 2 3 4 5 6 7 0 20 40 60 80 0 3 6 9 12 15
FLOPs (1e18) Wall Time (hrs) FLOPs (1e18)
(a) BERT-Small→BERT-Base (b) BERT-Small→BERT-Base (c) BERT-{Small, Base}→BERT-Large
Figure 2: Results on BERT. (a-b) shows validation log perplexity vs. FLOPs and wall time respectively
for training BERT-Base by reusing BERT-Small. (c) shows log perplexity vs. FLOPs in growing BERT-
Small and BERT-Base to BERT-Large. The solid line indicates the final perplexity of the larger model trained
from scratch, while the dotted line represents performance of the smaller model trained from scratch. LiGO
offers about 45% savings in FLOPs and 40% savings in wall time over BERT-Base training from scratch. Our
approach is also flexible in reusing either BERT-Small or BERT-Base for accelerating BERT-Large training.

Table 1: Downstream transfer learning performance on GLUE and SQuAD. All of the results are based on
BERT-Base models trained using the different baselines. LiGO achieves similar or even better performance than
the original training from scratch baseline on several downstream tasks, despite improving training efficiency.
Savings Savings SST-2 MNLI MRPC CoLA QNLI QQP STS-B SQuADv1.1 SQuADv2.0 Avg. Avg.
Method
(FLOPs) (Walltime) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (F1/EM) (F1/EM) GLUE SQuAD
Scratch – – 88.19 78.43 85.78 62.09 87.06 87.18 86.99 86.55 / 77.31 71.31 / 67.07 82.25 78.79 / 72.19
StackBERT 34.1% 33.3% 88.99 79.72 85.29 59.09 87.28 89.17 86.97 86.50 / 77.42 71.32 / 67.41 82.36 78.91 / 72.41
MSLT 34.9% 30.0% 88.53 78.10 82.60 64.76 83.58 88.54 85.89 86.07 / 76.73 70.68 / 67.17 81.72 78.47 / 71.95
KI -5.7% -13.9% 88.65 78.83 83.50 64.86 86.25 88.96 87.09 84.93 / 76.29 71.09 / 67.41 82.59 78.01 / 71.85
bert2BERT 29.0% 25.1% 88.30 80.05 85.54 61.73 88.16 86.18 87.00 86.24 / 77.09 71.52 / 66.85 82.42 78.88 / 71.97
LiGO 44.7% 40.7% 88.42 79.29 84.31 62.09 88.07 88.81 87.00 86.28 / 77.45 71.24 / 67.17 82.57 78.76 / 72.31

both models. For BERT, we use a batch size of 256 and a learning rate of 2e−4 , while we use a
batch size of 1024 and a learning rate of 8e−4 for training RoBERTa models.
Following Shen et al. (2022), we train GPT2 models with a batch size of 384 and sequence length
of 1024. For vision transformers, we build our models based on DeiT (Touvron et al., 2021a) and
CaiT (Touvron et al., 2021b), and apply their default hyper-parameters for training on ImageNet
dataset. We train all our vision transformers for 300 epochs with a batch size of 1024. For transfer
learning with BERT/RoBERTa, we follow Tan & Bansal (2020) and train for 3 epochs with a learn-
ing rate of 1e−4 and a batch-size of 32 for all tasks in GLUE. On SQuAD v1.1 and SQuAD 2.0,
we fine-tune for 2 epochs with a learning rate of 5e−5 and a batch size of 12. We run both GLUE
and SQuAD evaluations three times with different random seeds and report the mean numbers. For
transfer learning experiments on DeiT, we finetune the pretrained models with 1000 epochs, batch
size 768, learning rate 0.01, and use the same data augmentation in training on ImageNet. We use
the same pretraining data and experimental settings for all the baselines (including our approach)
for a fair comparison. Note that we include the additional compute required for training LiGO in
all our tables and figures. However, since our LiGO is only trained for 100 steps, the influence on
visualization and quantitative saving percentages is negligible.

4.2 R ESULTS AND A NALYSIS

BERT. Figure 2 shows the comparison between the different baselines for training BERT models.
As seen from Figure 2(a), LiGO saves 44.7% computational cost (FLOPs) of training BERT-Base
(12 layers, 768 dimensions) from scratch by reusing BERT-Small (6 layers, 512 dimensions). LiGO
offers 40.7% savings in wall time compared to training from scratch (Figure 2(b)). Among the com-
pared methods, StackBERT is the most competitive in terms of both FLOPs and wall time, although
LiGO obtains +10.6% and +7.2% improvements in FLOPs and wall time on top of StackBERT.
Similarly, LiGO significantly outperforms the recent bert2BERT method which saves about 30%
computational costs. We observe that KI does not provide any real savings in training as it requires
additional computation for knowledge distillation. Figure 2(c) shows that our LiGO approach is
flexible in growing either BERT-Small or BERT-Base for accelerating BERT-Large training. As ex-
pected, reusing BERT-Base instead of BERT-Small leads more savings in FLOPs (45.2% vs 30.3%)
as BERT-Base contains more implicit knowledge in its parameters. Table 1 shows the per-task per-

7
Published as a conference paper at ICLR 2023

3.0 3.0 4.0


Scratch KI Scratch KI Scratch LiGO
2.8 StackBert bert2BERT 2.8 StackBert bert2BERT
MSLT LiGO MSLT LiGO 3.30
3.8
1.90 1.90
2.6 2.6 3.25
Log Perplexity

Log Perplexity

Log Perplexity
1.85 1.85 3.6 3.20
2.4 2.4 22.5% 0.0%
1.80 3.50 4.00 4.50 5.00 5.50 6.00
47.2% 0.0% 1.80
2.2 2 3 4 5 2.2 51.1% 0.0% 3.4
20 30 40 50 60 70
2.0 2.0
3.2
1.8 1.8
0 1 2 3 4 5 6 0 20 40 60 80 3.0 0 1 2 3 4 5 6 7
FLOPs (1e18) Wall Time (hrs) FLOPs (1e17)
(a) RoBERTa-Small→RoBERTa-Base (b) RoBERTa-Small→RoBERTa-Base (c) GPT2-Base→GPT2-Medium
Figure 3: Results on RoBERTa and GPT2. LiGO reduces FLOPs by 47.2% and 22.5% for RoBERTa-Base
and GPT2-Medium, , demonstrating its effectiveness across different training strategies and architectures.

formance of different BERT-Base models on both GLUE and SQuAD benchmarks, where we find
that BERT trained with LiGO achieves very similar performance compared to the baselines on both
benchmarks. Finally, in Table 5 of the Appendix C.3, we show that growing BERT-Small to BERT-
Base with 100 steps of LiGO and then finetuning on GLUE tasks without additional pretraining
outperforms just directly finetuning BERT-Small.
RoBERTa and GPT2. Figure 3(a-b) shows the results on RoBERTa, whose training recipe uses
larger batch size and learning rate than BERT. LiGO similarly accelerates RoBERTa training, which
indicates that our method is robust to optimization hyperparameters. On GPT2, LiGO saves 22.5%
computation cost of training GPT2-Medium (345M parameters) by reusing GPT2-Base (117M pa-
rameters) (Figure 3(c)). These consistent improvements show that LiGO is effective for accelerating
transformer training across different model architectures and sizes.
Vision Transformers. Fig-
ure 4 shows that by grow- 85 85
ing from DeiT-S, LiGO can 80 80
save 55.4% FLOPs and 52% 75 75
81.5 81.5
GPU wall time to reach the
Valid Accuracy

Valid Accuracy

55.4% 0.0% 52.0% 0.0%


81.0 81.0
same performance of 81% on 70 70
80.5 80.5
ImageNet. Interestingly, the 65 80.0 65 80.0

model initialized by our data- 79.54 6 8 10 12 79.5 10 15 20 25 30


60 60
driven growth operator (w/
Scratch KI Scratch KI
only 100 gradient steps of 55 StackBert bert2BERT 55 StackBert bert2BERT
tuning) can already achieve MSLT LiGO MSLT LiGO
50 0 2 4 6 8 10 12 14 50 0 5 10 15 20 25 30 35 40
72% accuracy at the begin- FLOPs (1e18) Wall Time (hrs)
ning of training and leads to (a) DeiT-S→DeiT-B (b) DeiT-S→DeiT-B
the final accuracy of 81.7% Figure 4: Results on DeiT. (a) Accuracy vs. flops and (b) accuracy vs. wall
at the end of the training. time for training DeiT-B. LiGO saves flops and wall time by more than 50%
Compared to the next best over training from scratch on ImageNet.
method, bert2BERT, LiGO
obtains more than 15% savings, which once again demonstrates the effectiveness of our approach
in growing vision transformers as well. Table 2 shows that finetuning results on downstream tasks
perform on-par with the model trained from scratch, showing that LiGO does not harm the model’s
generalization capabilities when transferred to downstream datasets. We also find similar savings
in CaiT-XS→CaiT-S where LiGO saves FLOPs by 52.6% and wall time by 46.1% over training
CaiT-S from scratch on ImageNet (see Appendix C.2 for more details).
Combining with other training strate- Table 2: Transfer learning performance of DeiT-B. DeiT-B
gies. We also find that LiGO can be ef- model trained using LiGO performs similarly to the original
fectively combined with orthogonal strate- train-from-scratch baseline on all downstream tasks.
gies such as layer dropping (Zhang & Method FLOPs Walltime ImageNet CIFAR10 CIFAR100 Flowers Cars ChestXRay8
He, 2020), token dropping (Hou et al., Scratch
StackBERT

23.8%

15.1%
81.10
81.21
99.09
99.11
90.76
90.80
97.79
97.56
92.06
92.09
55.81
55.77
2022), and staged training (Chen et al., KI MSLT 36.7% 28.9%
−11.2% −36.8%
81.27
81.01
99.07
98.94
90.21
90.32
97.71
97.81
92.11
92.08
55.79
55.80
bert2BERT 40.8% 37.0% 81.59 99.14 90.69 97.67 92.15 55.82
2021). More details are included in Ap- LiGO 55.4% 52.0% 81.71 99.12 90.74 97.77 92.09 55.82

pendix B.3. Figure 5 shows that LiGO can


be combined with other training techniques to improve the computational savings by 4.7%, 7.4%,

8
Published as a conference paper at ICLR 2023

3.0 3.0 3.0 Scratch


Scratch Scratch bert2BERT
2.8 LiGO 2.8 LiGO 2.8 bert2BERT w/ ST
LiGO w/ Layer Dropping LiGO w/ Token Dropping LiGO
LiGO w/ ST
1.90 1.90 1.90
2.6 2.6 2.6
Log Perplexity

Log Perplexity

Log Perplexity
1.85 1.85 1.85
2.4 2.4 2.4
1.80 49.4% 0.0% 1.80 1.80
52.1% 0.0% 52.9% 36.2% 0.0%
2.2 2 3 4 5 2.2 2 3 4 5 2.2 2 3 4 5

2.0 2.0 2.0


1.8 1.8 1.8
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
FLOPs (1e18) FLOPs (1e18) FLOPs (1e18)
(a) LiGO w/ Layer Dropping (b) LiGO w/ Token Dropping (c) LiGO w/ Staged Training
Figure 5: LiGO with other efficient training strategies. Our approach can be combined with (a) layer dropping,
(b) token dropping, and (c) staged training (ST), for further accelerate BERT training.

and 8.2% with layer dropping, token dropping and staged training, respectively. Following (Chen
et al., 2021), we also apply staged training strategy to bert2BERT and observe that LiGO still out-
performs bert2BERT with staged training by 16.7% (see Figure 5(c)).
4.3 A BLATION S TUDIES
3.0 3.0
Scratch MSLT Scratch AKI
2.8 2.8 StackBert LiGO DirectCopy LiGO
Depth-only expansion. We examine the FPI InterBert
1.90 1.90
effectiveness of our proposed depth ex- 2.6 2.6
Log Perplexity

Log Perplexity
1.85 1.85
pansion operator (Ldepth ) by only grow- 2.4 2.4
1.80 1.80
51.9% 0.0% 41.6% 0.0%
ing the depth of BERT from 6 layers to 2.2 2.2 2 3 4 5 2 3 4 5

12 layers, i.e, (BERT(6, 768)→BERT(12, 2.0 2.0


768). We compare with stacking (Stack- 1.8 1.8
BERT, Gong et al., 2019), Interpolation (In- 0 1 2 3 4
FLOPs (1e18)
5 6 0 1 2 3 4
FLOPs (1e18)
5 6
terBERT, Chang et al., 2017; Dong et al., (a) BERT(6, 768)→BERT(12, 768) (b) BERT(12, 512)→BERT(12, 768)
2020) (see Eq. 1), and MSLT (Yang et al.,
2020). For LiGO, we only apply its Ldepth Figure 6: Results on Depth-only and Width-only growth.
component to the pre-trained model weights. LiGO saves 51.7% FLOPS when expanding depth-only,
and 41.6% FLOPS when expanding width-only.
Results in Figure 6(a) show that a data-driven
approach works well even when just growing across the depth dimension.
Width-only expansion. We also verify the effectiveness of Rwidth Table 3: Effect of number of gra-
by only extending BERT width from 512 to 768, i.e., BERT(12, dient steps. “+FLOPs” stands for
512)→BERT(12, 768). We compare LiGO based initialization with additional flops (in 1015 ).
direct copy (Wei et al., 2016), function preserving initialization # of Steps +FLOPs Savings
(FPI, Chen et al., 2015), and advanced knowledge initialization 100 3.61 44.7%
(AKI, Chen et al., 2021). LiGO’s width expansion component out- 500 18.06 44.5%
performs all other methods, as shown in Figure 6(b). 1000 36.13 44.2%
10000 361.30 38.9%
Number of growing steps. Our main experiments just use 100
gradient steps to grow. We tune our LiGO on the pretraining set for 100, 500, 1000, and 10000
steps and compute the additional FLOPs for BERT-Small→BERT-Base training. Table 3 shows that
training LiGO within 1000 steps results in the identical model convergence (reaching 1.8 PPL at
215K steps). This suggests tuning model weights under the linear constraints of LiGO can achieve
faster convergence. Training LiGO for more than 10000 steps can provide a model with slightly
faster convergence (214K steps), but results in less saving overall.

5 C ONCLUSION
This paper describes an approach for accelerating transformer training by learning to grow pre-
trained transformers, where the larger transformer’s parameters are initialized as a linear mapping
from the smaller pretrained model’s parameters, The linear map is factorized to be a composition of
sparse width- and depth-expansion operators with a Kronecker factorization that groups parameters
into layers and neurons. We demonstrate the effectiveness of our proposed approach on both lan-
guage and vision transformers of different sizes, outperforming several competing methods. While
our compute resources prevented us from applying LiGO to even larger transformers, it would be
interesting to see if this can be applied on top of even larger models.

9
Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS
PW sincerely thanks Zhen Wang for the insightful discussion and for providing reference reposi-
tories for language model pre-training. PW also appreciates Hao Tan’s assistance for reproducing
fine-tuning results on GLUE datasets. YK and LTH were partially supported an MIT-IBM Watson
AI grant and an Amazon award. We also acknowledge support from the IBM Research AI Hard-
ware Center, and the Center for Computational Innovation at Rensselaer Polytechnic Institute for the
computational resources on the AiMOS Supercomputer. The research of ZW is in part supported by
the US Army Research Office Young Investigator Award (W911NF2010240).

R EFERENCES
Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale
image recognition without normalization. In International Conference on Machine Learning, pp.
1059–1071, 2021.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. In Proceedings of NeurIPS, 2020.
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by
network transformation. In Proceedings of AAAI, 2018.
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual net-
works from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao
Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. arXiv
preprint arXiv:2110.07143, 2021.
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. Advances in neural information processing systems, 31, 2018.
Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge
transfer. arXiv preprint arXiv:1511.05641, 2015.
Justin Chiu, Yuntian Deng, and Alexander Rush. Low-rank constraints for fast inference in struc-
tured models. Advances in Neural Information Processing Systems, 34:2887–2898, 2021.
Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: A neural network synthesis tool based on a
grow-and-prune paradigm. IEEE Transactions on Computers, 68(10):1487–1497, 2019.
Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for
linear transforms using butterfly factorizations. In International conference on machine learning,
pp. 1517–1527, 2019.
Tri Dao, Nimit S Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri
Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured
linear maps. arXiv preprint arXiv:2012.14966, 2020.
Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu,
Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for
efficient and accurate training. In International Conference on Machine Learning, pp. 4690–
4721, 2022.
Yann N Dauphin and Samuel Schoenholz. Metainit: Initializing learning by learning to initialize.
Advances in Neural Information Processing Systems, 32, 2019.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-
erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pp. 248–255, 2009.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.

10
Published as a conference paper at ICLR 2023

Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards adaptive residual network
training: A neural-ode perspective. In International conference on machine learning, pp. 2616–
2626. PMLR, 2020.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko-
reit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale. In Proceedings of ICLR, 2021.
Utku Evci, Max Vladymyrov, Thomas Unterthiner, Bart van Merriënboer, and Fabian Pe-
dregosa. Gradmax: Growing neural networks using gradient information. arXiv preprint
arXiv:2201.05125, 2022.
Scott Fahlman. The recurrent cascade-correlation architecture. In Advances in Neural Information
Processing Systems, 1990.
Scott Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In Advances in
Neural Information Processing Systems, 1989.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pp. 249–256, 2010.
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of
bert by progressively stacking. In International conference on machine learning, pp. 2337–2346,
2019.
Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, and Jiawei Han. On the transformer
growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
Steven Gutstein, Olac Fuentes, , and Eric Freudenthal. Knowledge transfer in deep convolutional
neural nets. In Proceedings of International Journal on Artificial Intelligence Tools, 2008.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, and Denny
Zhou. Token dropping for efficient bert pretraining. arXiv preprint arXiv:2203.13240, 2022.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.
In International Conference on Machine Learning, 2019.
Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. Improving transformer opti-
mization through better initialization. In International Conference on Machine Learning, pp.
4475–4483, 2020.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural
networks using pipeline parallelism. Advances in neural information processing systems, 32,
2019.
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and
Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
Yannic Kilcher, Gary Bécigneul, and Thomas Hofmann. Escaping flat areas via function-preserving
structural network modifications. 2018.
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained
categorization. In Proceedings of the IEEE international conference on computer vision work-
shops, pp. 554–561, 2013.

11
Published as a conference paper at ICLR 2023

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691, 2021.
Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, and Yi Yang. Au-
tomated progressive learning for efficient training of vision transformers. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12486–12496, 2022.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,
2015.
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number
of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing,
pp. 722–729. IEEE, 2008.
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapter-
fusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247,
2020.
Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu,
Peng Li, Maosong Sun, et al. Knowledge inheritance for pre-trained language models. arXiv
preprint arXiv:2105.13880, 2021.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing by generative pre-training. 2018.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions
for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822, 2018.
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo,
Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function
emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the
National Academy of Sciences, 118(15), 2021. ISSN 0027-8424. doi: 10.1073/pnas.2016239118.
Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction
of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019.
Kathrin Schacke. On the kronecker product. Master’s thesis, University of Waterloo, 2004.
Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training
for transformer language models. arXiv preprint arXiv:2203.06211, 2022.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par-
allelism. arXiv preprint arXiv:1909.08053, 2019.
Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep
learning. Advances in Neural Information Processing Systems, 28, 2015.

12
Published as a conference paper at ICLR 2023

Hao Tan and Mohit Bansal. Vokenization: Improving language understanding with contextualized,
visual-grounded supervision. arXiv preprint arXiv:2010.06775, 2020.
Shanshan Tang, Bo Li, and Haijun Yu. Chebnet: Efficient and stable constructions of deep
neural networks with rectified power units using chebyshev approximations. arXiv preprint
arXiv:1911.05467, 2019.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and
Hervé Jégou. Training data-efficient image transformers & distillation through attention. In
International Conference on Machine Learning, pp. 10347–10357, 2021a.
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going
deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021b.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Proceedings of NeurIPS,
2017.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue:
A multi-task benchmark and analysis platform for natural language understanding. In Proceedings
of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for
NLP, pp. 353–355, 2018.
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Sum-
mers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised
classification and localization of common thorax diseases. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 2097–2106, 2017.
Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism. In Maria Florina
Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on
Machine Learning, pp. 564–572, 2016.
Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures.
Advances in neural information processing systems, 32, 2019.
Lemeng Wu, Dilin Wang, Peter Stone, and Qiang Liu. Firefly neural architecture descent: a general
approach for growing neural networks. Advances in neural information processing systems, 2021.
Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, and Jingqiao Zhang. Progressively
stacking 2.0: A multi-stage layerwise training method for bert training speedup. arXiv preprint
arXiv:2011.13635, 2020.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep
learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without
normalization. arXiv preprint arXiv:1901.09321, 2019.
Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with
progressive layer dropping. Advances in Neural Information Processing Systems, 33:14011–
14023, 2020.
Xu Zhang, Felix X Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, and Shi-Fu Chang. Fast orthogo-
nal projection based on kronecker product. In Proceedings of the IEEE International Conference
on Computer Vision, pp. 2929–2937, 2015.
Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W Ronny Huang, and Tom Goldstein. Gradinit:
Learning to initialize neural networks for stable and efficient training. Advances in Neural Infor-
mation Processing Systems, 34:16410–16422, 2021.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer
vision, pp. 19–27, 2015.

13
Published as a conference paper at ICLR 2023

A U NIVERSALITY OF L I GO O PERATOR
Proposition 1. StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all the special
cases of the LiGO operator (Eq. 8).

Proof. We prove Proposition 1 by constructing parameters in Ldepth and Rwidth .

Stacking. Stacking-based methods (Gong et al., 2019; Yang et al., 2020) duplicate the entire lower
blocks on top of the small model to the form new layers (Eq. 1). Formally, we show this operation
can be done by the following operator:
I
 
 I  
 ..  I
 . 

..
M = (9)
 
 I 
 . 

 I 
 I
. . | Rwidth
{z }
.
| {z }
Ldepth

Interpolation. Interpolation based methods (Chang et al., 2017; Dong et al., 2020) interleave each
layer for twice. We can construct the following matrix to achieve layer interpolation (Eq. 1).
I
 
I  
 I  I
 
M =
 I  .. 
(10)
 . 
 ..  

 . I
. . | Rwidth
{z }
.
| {z }
Ldepth

We remark that any rearrangement of layers to construct new layers (mathematically a permutation
of existing layers with replacement) can be constructed in a similar way.

Net2Net. Since we show in Eq. 6, the Kronecker factorization on Rl amounts to decomposing


the general growth operator into in-dimension and out-dimension expansion. We can construct
Net2Net (Chen et al., 2015) based growth by simply letting:
 
A1 ⊗ B 1
Ldepth = I ∈ RL1 D2 ×L2 D2 , Rwidth = 
 .. 
(11)
. 
AL1 ⊗ B L1
   
I I
Al = e , Bl = (12)
S l−1 Sl,
where S l ∈ {0, 1}(D2 −D1)×D1 is a selection matrix to enlarge the out dimension, and S e l−1 =
⊤ −1
S l−1 diag(1 S l−1 ) copies the selection from S l−1 with normalization to guarantee functionality
preserving in expansion.

B I MPLEMENTATION D ETAILS
B.1 G ROWING T RANSFORMERS WITH L I GO

The transformer architecture consists of an embedding layer, multi-block attention layer, and an
output layer. The core ingredient attention block consists of a Multi-Head Attention (MHA) module
followed by a FeedForward Network (FFN), with a skip connection across the both blocks. Applying
LiGO requires the following considerations:

14
Published as a conference paper at ICLR 2023

Embedding layer. For both language and vision transformers, the embedding layer can be re-
garded as a linear layer, whose inputs are one-hot embeddings in language models. We draw a
learnable matrix B (emb) to extend its output dimension.

Multi-head attention blocks. An attention layer in transformer consists of multi-head attention


weights (W Q , W K , W V ) and a linear projection (W O ). Let Akl and B kl with k ∈ {Q, K, V, O}
be the in- and out-dimension expansion matrices (Eq. 6) for query, key, value, and projection in the
l-th layer, respectively. Applying B kl to W k (k ∈ Q, K, V ) constructs new heads by a weighted
summation of rows of all existing heads. To make sure the new input and output channels are
aligned across modules, we tie our LiGO operator with the following scheme: (1) Akl = (B (emb) )⊤
(V ) ⊤
for ∀k ∈ {Q, K, V }, (2) AO l = (B l ) , (3) B O
l = B
(emb)
for ∀l ∈ [L1 ]. Both the bias and
layer normalization inherit the associated linear transformations’ out-dimension expansion matrices
to grow the width. For depth expansion, each module independently combines the same module
from other layers (Eq. 8) with learnable coefficients w.

Feed-forward networks. Each attention block is followed by a two-layer FFN. Let Akl and B kl
with k ∈ {f c1, f c2} be the in- and out-dimension expansion matrices (Eq. 6) for the first and
second FFN layer in the l-th layer, respectively. We tie the parameters for feed-forward networks:
(f c1) (f c2) (f c1)⊤ (f c2)
Al = B (emb)⊤ , Al = Bl and B l = B (emb) .

Output layer. For output head, we have A(out) = B (emb)⊤ , since the output dimension of at-
tention layers are always aligned with B (emb) by our construction. The output layer does not need
out-dimension expansion. Algorithm 1 summarizes LiGO for growing transformers.

B.2 M ODEL C ONFIGURATIONS

We summarize the settings of different transformer models used for our experiments in Table 4. For
BERT and RoBERTa, we re-use the code base provided by Tan & Bansal (2020). For GTP2, we
follow the model configuration of OpenAI and use the pre-training code provided by Shen et al.
(2022). For DeiT, we use their official codebase (Touvron et al., 2021a).

Table 4: Configuration of different transformers.

BERT-Small BERT-Base RoBERTa-Small RoBERTa-Base GPT2-Base GPT2-Medium DeiT-S DeiT-B


# layers 6 12 6 12 12 24 # layers 12 12
# hidden 512 768 512 768 768 1024 # hidden 384 768
# heads 8 12 8 12 12 16 # heads 6 12
# vocab 30522 30522 50265 50265 50257 50257 input res. 224 224
seq. length 128 128 128 128 1024 1024 patch size 16 16

B.3 O RTHOGONAL E FFICIENT T RAINING S TRATEGIES

For layer dropping, we follow the same progressive dropping rate schedule with Zhang & He (2020),
and set the maximum dropping rate to 0.1 to recover the performance. For token dropping, we
randomly set 15% tokens aside in the middle layers. In the first 50k steps of staged training, only a
sub-network is activated and trained, and afterwards, we perform full-model training for 350k steps.

C A DDITIONAL E XPERIMENTS

C.1 R EUSING SMALLER MODELS TRAINED FOR ONLY FEW STEPS

LiGO focuses on utilizing the knowledge of smaller models that have already been pretrained and
available. In this section, we investigate how LiGO can leverage smaller existing models that are
only trained for few steps to accelerate training of a larger model. We perform an experiment on
BERT-Base by reusing a BERT-Small trained for only 50k steps instead full training for 220k steps
as used in our experiments. Figure 7 shows that LiGO can still save 35.2% savings in FLOPs and
30.2% savings in wall time over the BERT-Base training from scratch.

15
Published as a conference paper at ICLR 2023

Algorithm 1 A forward pass of LiGO with transformer.


1: Input: A small transformer with hidden D1 and number of layer L1 . Denote the embedding
layer as W (emb) ∈ RD1 ×E , attention layers as W Q K V
l ,W ,Wl ,Wl ∈ R
O D1 ,×D1
, FFN layers
(f c1) 4D1 ×D1 (f c2) D1 ×4D1 (ln1)
as W l ∈ R , Wl ∈ R , LayerNorm layers as W l ∈ RD1 ×4D1 ,
(ln1) (ln2)
Wl ,Wl ∈ RD1 , ∀l ∈ [L1 ], the output head W (out) ∈ RC×D1
2: Output: A large transformer with hidden D2 and number of layer L2 . Denote the weight
matrices as Ω with the corresponding superscripts as the small model.
3: Ω(emb) ← B (emb) W (emb)
4: for l = 1, · · · , L1 do ▷ Width Expansion
5: ΩQl ← B (Q)
W Q (emb)⊤
l B
6: ΩKl ←B
(K)
WK l B
(emb)⊤
V (V ) V (emb)⊤
7: Ωl ← B W l B
8: ΩOl ←B
(emb)
W Vl B (V )⊤
(ln1) (ln1)
9: Ωl ← B (emb) W l
(f c1)
10: Ωl ← B (f c1) W Vl B (emb)⊤
(f c2)
11: Ωl ← B (emb) W Vl B (f c1)⊤
(ln2) (ln2)
12: Ωl ← B (emb) W l
13: end for
14: for l = 1, · · · , L2 do ▷ Depth Expansion
PL1 Q Q
15: ΩQ ← w Ω
l,j j
l
Pj=1L1
16: ΩKl ← K K
j=1 l,j Ωj
w
L
ΩVl ← j=1 wV ΩVj
P 1
17:
O PL1 l,j
18: Ωl ← j=1 wl,j O
ΩOj
(ln1) PL1 (ln1) (ln1)
19: Ωl ← j=1 wl,j Ωj
(f c1) PL1 (f c1) (f c1)
20: Ωl ← j=1 wl,j Ωj
(f c2) PL1 (f c2) (f c2)
21: Ωl ← j=1 wl,j Ωj
(ln2) PL1 (ln2) (ln2)
22: Ωl ← j=1 wl,j Ωj
23: end for
24: Ω(out) ← W (out) B (emb)⊤
25: Train transformer with parameters Ω.

C.2 R ESULTS ON C AI T

In addition to DeiT (Touvron et al., 2021a), we perform additional experiments with CaiT (Touvron
et al., 2021b) on ImageNet and find that while reusing CaiT-XS, LiGO offers about 52.6% savings
in FLOPs and 46.1% savings in wall time over the CaiT-S training from scratch (see Figure 8).

C.3 TASK - SPECIFIC FINETUNING WITH L I GO INITIALIZATION WITHOUT FURTHER


PRETRAINING

We perform additional experiments by directly finetuning BERT-Base initialized by LiGO (from


BERT-Small) without any further pretraining. We observe in Table 5 that the LiGO-initialized model
can benefit downstream tasks compared to BERT-Small trained from scratch (1st row vs 2nd row).

C.4 GLUE P ERFORMANCE USING A DAPTER F USION

LiGO is mainly proposed for improving efficiency of the pre-training stage and hence is compatible
with various finetuning schemes like full model finetuning, adapters (Houlsby et al., 2019; Pfeiffer
et al., 2020) or prompt tuning (Lester et al., 2021; Jia et al., 2022) for adaptation to downstream
tasks. We test BERT-Base models trained using different baselines by using adapterfusion (Pfeiffer
et al., 2020) instead of full finetuning on GLUE benchmark. Table 6 shows that LiGO also achieves

16
Published as a conference paper at ICLR 2023

3.0 3.0
Scratch LiGO from 50k Scratch LiGO from 50k
2.8 LiGO 2.8 LiGO

1.90 1.90
2.6 2.6
Log Perplexity

Log Perplexity
1.85 1.85
2.4 2.4
1.80
44.7% 35.2% 0.0% 1.80
2.2 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2.2 40.7% 30.2% 0.0%
30 40 50 60
2.0 2.0
1.8 1.8
0 1 2 3 4 5 6 7 0 20 40 60 80
FLOPs (1e18) Wall Time (hrs)
(a) BERT-Small→BERT-Base (b) BERT-Small→BERT-Base

Figure 7: Results on BERT-Base by reusing BERT-Small trained for 50k steps. Instead of training BERT-Base
from fully trained BERT-Small, we run LiGO on BERT-Small trained with 50k steps. LiGO offers about 35.2%
savings in FLOPs and 30.2% savings in wall time over the BERT-Base training from scratch.

85 85
Scratch LiGO Scratch LiGO
80 80
75 81.5 75 81.5
Valid Accuracy

Valid Accuracy

81.0 81.0
70 52.6% 0.0% 70 46.1% 0.0%
80.5 80.5

65 80.0 65 80.0
79.52 4 6 8 10 12 14 79.5 10 15 20 25
60 60
55 55
50 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 50 0 5 10 15 20 25 30 35
FLOPs (1e18) Wall Time (hrs)
(a) CaiT-XS→CaiT-S (b) CaiT-XS→CaiT-S
Figure 8: Results on CaiT. (a) Accuracy vs. flops and (b) accuracy vs. wall time for training CaiT-S. LiGO
saves flops by 52.6% and wall time by 46.1% over training from scratch on ImageNet.

Table 5: GLUE performance of different LiGO models. All of the results are based on BERT-Base models
with BERT-Small as the base model for LiGO optimization.
SST-2 MNLI MRPC CoLA QNLI QQP STS-B Average
Method
(Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.)
BERT-Small (Scratch) 87.21 77.56 82.11 59.93 85.06 85.82 84.99 80.38
BERT-Base (LiGO Init) 88.15 77.62 82.53 60.70 85.79 86.65 85.83 81.04
BERT-Base (LiGO Init + Pretrain) 88.42 79.29 84.31 62.09 88.07 88.81 87.00 82.57
BERT-Base (Scratch) 88.19 78.43 85.78 62.09 87.06 87.18 86.99 82.25

17
Published as a conference paper at ICLR 2023

Table 6: Downstream performance using AdapterFusion (Pfeiffer et al., 2020) on GLUE Benchmark. All of
the results are based on BERT-Base models trained using different baselines.
Savings Savings SST-2 MNLI MRPC CoLA QNLI QQP STS-B Average
Method
(FLOPs) (Walltime) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.)
Scratch – – 88.41 78.60 86.02 62.39 87.62 88.02 86.52 82.51
StackBERT 34.1% 33.3% 88.78 79.80 85.43 59.56 87.71 89.19 86.27 82.39
MSLT 34.9% 30.0% 88.41 78.35 83.15 63.97 86.19 88.20 86.42 82.10
KI -5.7% -13.9% 88.94 78.84 84.00 64.61 86.75 88.19 87.93 82.75
bert2BERT 29.0% 25.1% 88.47 80.53 85.50 62.33 88.57 86.72 87.10 82.75
LiGO 44.7% 40.5% 88.45 80.01 84.67 63.05 88.06 88.92 87.00 82.88

on-par performance with model trained from scratch under adapter-based tuning with 44.7% savings
in FLOPs abd 40.5% savings in wall time. This shows that LiGO does not harm the model general-
ization capability when adapters are used as a parameter-efficient finetuning strategy for transferring
a trained model to downstream datasets.

C.5 I NITIAL RESULTS ON BILLION + PARAMETER MODELS

Our extensive experiments on BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT2 (Rad-
ford et al., 2019), DeiT (Touvron et al., 2021a) and CaiT (Touvron et al., 2021b) show that LiGO
can consistently improve transformer training efficiency over the traditional way of training from
scratch across domains and model sizes. One interesting future direction of our work is scaling
LiGO to very large models with parameters more than 100B, such as GPT3 (Brown et al., 2020).
While we currently do not possess the compute resources for this extreme large-scale study, we per-
form a preliminary experiment on GPT2-1.5B (Radford et al., 2019) by using GPT2-Medium as the
initialization. We train for 15k steps on C4 dataset (Raffel et al., 2020) and find that our proposed
LiGO saves about 39% computation cost (FLOPs) of training GPT2-1.5B from scratch to reach the
same log perplexity (which is 3.3). We believe that it is imperative to study the extent to which the
benefits of LiGO remain at the scale on which the modern large language models are trained. We
hope to cover this in our future work.

18

You might also like