2031 Learning To Grow Pretrained Mo
2031 Learning To Grow Pretrained Mo
A BSTRACT
Scaling transformers has led to significant breakthroughs in many domains, lead-
ing to a paradigm in which larger versions of existing models are trained and
released on a periodic basis. New instances of such models are typically trained
completely from scratch, despite the fact that they are often just scaled-up ver-
sions of their smaller counterparts. How can we use the implicit knowledge in
the parameters of smaller, extant models to enable faster training of newer, larger
models? This paper describes an approach for accelerating transformer training
by learning to grow pretrained transformers, where we learn to linearly map the
parameters of the smaller model to initialize the larger model. For tractable learn-
ing, we factorize the linear transformation as a composition of (linear) width-
and depth-growth operators, and further employ a Kronecker factorization of
these growth operators to encode architectural knowledge. Extensive experiments
across both language and vision transformers demonstrate that our learned Lin-
ear Growth Operator (LiGO) can save up to 50% computational cost of train-
ing from scratch, while also consistently outperforming strong baselines that also
reuse smaller pretrained models to initialize larger models.1
1 I NTRODUCTION
The transformer architecture (Vaswani et al., 2017) has emerged as a general purpose architecture
for modeling many structured domains (Devlin et al., 2019; Brown et al., 2020; Rives et al., 2021;
Dosovitskiy et al., 2021; Touvron et al., 2021a). Perhaps more so than other architectures, the
transformer empirically seems to have inductive biases that make it especially amenable to scaling
(Rosenfeld et al., 2019; Kaplan et al., 2020), which has led to a paradigm in which larger versions of
smaller, existing models are trained and released on a periodic basis (e.g., the GPT lineage of models
(Radford et al., 2018; 2019; Brown et al., 2020)). New instances of such models are typically trained
completely from scratch, despite the fact that they are often scaled-up versions of their smaller
counterparts. Given the compute required to train even the smaller models, we argue that training
each model from scratch is wasteful, and that prior knowledge implicit in the parameters of smaller
pretrained models should be leveraged to enable faster training of larger models.
One approach to this problem is through the lens of model growth, wherein a smaller model’s pre-
trained parameters are used to initialize a subset of the larger model’s parameters. While earlier
works generally froze the parameters initialized from the pretrained model and only trained the new
(randomly initialized) parameters (Fahlman & Lebiere, 1989; Fahlman, 1990; Gutstein et al., 2008),
subsequent work has shown that copying a subset of the pretrained parameters to initialize the new
parameters and then finetuning the entire network significantly accelerates training and sometimes
even leads to better performance (Chen et al., 2015). When applied to modern transformers, these
mechanisms roughly translate to a depth-expansion operator in which pretrained models are stacked
(or combined with identity layers) to initialize deeper transformers (Gong et al., 2019; Yang et al.,
2020), and a width-expansion operator in which the smaller model’s matrices are copied to initialize
the larger model’s matrices (e.g., in block-diagonal fashion) (Chen et al., 2021; Gu et al., 2020).
∗
Work done during an internship at MIT-IBM Watson AI Lab.
1
Project page: https://vita-group.github.io/LiGO/
1
Published as a conference paper at ICLR 2023
Figure 1: Our linear growth operator (LiGO) accelerates training by using the weights of a smaller model Θ
to initialize the weights of the larger model Θ(new) . LiGO is parameterized as a sparse linear map M that
can be decomposed into width- and depth-expansion operators. The width-operator Rwidth and depth-operator
Ldepth are structured matrices obtained from Kronecker products of smaller matrices which encode architec-
tural knowledge by grouping parameters into layers and neurons. While we show the expansion operators for
simple multi-layer perceptrons for illustrative purposes, in practice we apply LiGO to enable faster training of
transformer networks. In our approach, we learn the growth matrix M with a 100 steps of SGD, use this to
initialize the larger model, and then continue training as usual. Best viewed in color.
Noting the empirical effectiveness of such recipes, we observe that existing mechanisms generally
do not have a learning component (e.g., randomly copying over neurons for width-expansion or
stacking consecutive layers for depth-expansion). This paper instead proposes an efficient, data-
driven approach for learning to grow transformers. In particular, our approach frames the problem
of initializing the larger model’s parameters as learning a linear mapping from the smaller model’s
parameters, i.e., Θ(large) = M Θ(small) where Θ(small) and Θ(large) are the vectorized parame-
ters of the small/large models. Due to the high dimensionality of the parameters, this mapping is
completely intractable to learn without any restrictions on M . We thus factorize the linear mapping
to be a composition of sparse width- and depth-expansion operators, M = Ldepth Rwidth , where
both width and depth matrices are further factorized to be a Kronecker product of smaller matri-
ces that express architectural knowledge (e.g., through grouping parameters by layers and neurons).
We show that our growth operators can represent existing approaches such as layer-stacking and
neuron-copying as special cases. We find that with a small amount of learning on M (e.g., 100
gradient steps) to initialize the larger model, we can significantly accelerate training of both vision
and language transformers. Figure 1 illustrates our approach.
We apply our learned linear growth operator (LiGO) to popular families of models—BERT (Devlin
et al., 2019), RoBERTa (Liu et al., 2019), GPT2 (Radford et al., 2019), and ViT (Dosovitskiy et al.,
2021; Touvron et al., 2021a;b)—and find that LiGO can consistently improve transformer training
efficiency over the traditional way of training from scratch across domains and model sizes. For
instance, LiGO saves 44.7% and 22.5% FLOPs for training BERT-Base and GPT2-Medium from
scratch by reusing pretrained smaller models that are half as big. Similarly, for vision transformers,
when using DeiT-S (Touvron et al., 2021a) for initialization, LiGO yields 55% savings in FLOPs
with no performance drop on ImageNet (Deng et al., 2009). These FLOPs savings directly trans-
late to similar wall clock savings. We further find that models trained using LiGO achieve similar
performance to the trained-from-scratch baselines when transferred to downstream tasks.
2 R ELATED W ORK
Efficient training. Efficient training of transformers has been studied from multiple perspectives.
Some methods that are orthogonal to our work include mixed precision training (Shoeybi et al.,
2019), large batch optimization (You et al., 2019), distributed training (Huang et al., 2019), and
dropping layers (Zhang & He, 2020) or tokens (Hou et al., 2022). Knowledge inheritance (Qin
et al., 2021) explores knowledge distillation during pretraining to efficiently learn larger transform-
ers. Progressive training, which first trains a small transformer with few layers and then gradually
expands by stacking layers, has also been applied to accelerate transformer training (Gong et al.,
2019; Yang et al., 2020; Li et al., 2022; Shen et al., 2022). Net2Net Chen et al. (2015) uses function-
preserving transformations to grow width by copying neurons and depth by using identity layers.
Recently, bert2BERT (Chen et al., 2021) extends Net2Net to transformers. In contrast to these ap-
proaches, our approach learns to (linearly) transform the parameters of a smaller model to initialize a
2
Published as a conference paper at ICLR 2023
larger model. While there is a line of work on learning to grow neural networks in a data-driven way,
these methods are in general difficult to apply to modern-scale transformers since they (for example)
involve growing a single neuron at a time or employ expensive optimization/search procedures (Wei
et al., 2016; Cai et al., 2018; Wu et al., 2019; 2021; Evci et al., 2022).
Network initialization. Our work is also related to work on neural network initialization. Exist-
ing works include controlling the norm of the parameters (Mishkin & Matas, 2015; Kilcher et al.,
2018; Dai et al., 2019; Wu et al., 2019; Glorot & Bengio, 2010) or replacing the normalization lay-
ers (Brock et al., 2021; Zhang et al., 2019; Huang et al., 2020). MetaInit (Dauphin & Schoenholz,
2019) proposes an automatic method that optimizes the norms of weight tensors to minimize the
gradient quotient on minibatches of random Gaussian samples. GradInit (Zhu et al., 2021) learns to
initialize larger networks by adjusting norm of each layer. Our work focuses on using smaller pre-
trained transformers to better initialize larger transformers, which remains an understudied problem.
Structured matrices. Finally, our work is also related to structured matrices which are typically
used to replace dense weight matrices for reducing training and inference computation cost. Ex-
amples include sparse and low rank matrices (Chiu et al., 2021; Han et al., 2015), Chebyshev
matrices (Tang et al., 2019), Toeplitz matrices (Sindhwani et al., 2015), Kronecker-product ma-
trices (Zhang et al., 2015), and butterfly matrices (Dao et al., 2019). A unified framework to learn
a broad family of structured matrices is presented in Sindhwani et al. (2015). Dao et al. (2022)
propose Monarch matrices, which inherit the expressiveness of butterfly matrices and achieve rea-
sonable accuracy-efficiency tradeoffs in many applications. While our approach is inspired by these
works, we propose to grow pretrained models by learning structured sparse linear operators with
Kronecker factorization, which to our knowledge has not been explored in the literature.
3 P ROPOSED A PPROACH
Notation. We denote the parameters of a neural network with L layers and D dimensions as
⊤
ΘL,D = [W 1 · · · W L ] ∈ RLD×D , where W l ∈ RD×D denotes the weights for the l-th
2
layer. With slight abuse of notation, we denote the vectorization of ΘL,D as vec(ΘL,D )⊤ =
vec(W 1 )⊤ · · · vec(W L )⊤ . Our goal is to re-use the parameters Θ = ΘL1 ,D1 from a pre-
3
trained smaller model to initialize a large model Θ(new) = ΘL2 ,D2 through a model growth operator
M : RL1 D1 ×D1 → RL2 D2 ×D2 that maps the weights of the smaller network to the weights of the
larger one, i.e., Θ(new) = M (Θ) where L1 < L2 and D1 < D2 . After model growth, we adopt
Θ(new) as the initialization of the large model and train it using standard recipes.
Existing works have separately established model growth operators for depth (L1 < L2 , D1 = D2 )
and width (L1 = L2 , D1 < D2 ). We summarize these methods below.
Depth expansion. StackBERT (Gong et al., 2019) proposes to duplicate the smaller model to double
the depth, based on the observation that upper layers share similar functionality with the lower
layers. In contrast, interpolation-based depth expansion methods (Chang et al., 2017; Dong et al.,
2020) interleave every layer to form a deeper model, which can be roughly interpreted as simulating
a finer-grained solution to the original dynamical system from a neural ODE perspective (Chen et al.,
2018). Letting L2 = kL1 , the two methods’ growth operators can be formulated as:
(new) (new)
(StackBERT) W l = W l mod L1 , (Interpolation) W i = W ⌊l/k⌋ , ∀l ∈ [L2 ]. (1)
Width expansion. Net2Net (Chen et al., 2015) expands the width of neural networks by randomly
copying neurons while preserving output values via normalization. This can be seen as growing a
matrix associated with a particular layer by duplicating the columns and rows of its weight matrix.
(new)
Suppose a layer has weight matrix W l ∈ RD1 ×D1 .4 To expand it to a matrix W l ∈ RD2 ×D2
2
For notational brevity we assume that each hidden layer has same number of dimensions D, but LiGO can
be straightforwardly generalized to layers with different dimensions (e.g., FFN layers of transformers).
2
3
We therefore have vec(ΘL,D )⊤ ∈ RLD . Our approach is also agnostic with regard to vectorization order.
4
We define a single layer as fl (x) = W l x + bl , where the row number of W l corresponds to the output
dimension, and the column number of W l corresponds to the input dimension.
3
Published as a conference paper at ICLR 2023
(new)
(D2 > D1 ), Net2Net copies W l to its upper-left corner of W l , fills the new columns via a
random selection matrix S l , and finally duplicates and normalizes rows according to the selection
matrix from the previous layer. Formally, the growth operator of Net2Net can be written as:
(new) I
(Net2Net) W l = D −1
l W l [I S l ] , D l = diag(S l−1 1) + I, ∀l ∈ [L2 ] (2)
S⊤ l−1
where S l ∈ {0, 1}D1 ×(D2 −D1 ) is a random selection matrix. The diagonal of D l is a D1 -
dimensional histogram, whose i-th entry indicates number of times i-th column of W l was copied.
While existing operators have been empirically successful in accelerating transformer-based models
such as BERT (Gong et al., 2019; Chen et al., 2021), we observe that generally do not have a learning
component and perform the depth- and width-expansions separately. In this section we introduce
a general framework for learning to grow with a linear growth operator (LiGO), which generalizes
existing operators by combining the width- and depth-growth operators in a data-driven way.
We can formulate the problem of initializing the weights of the larger model Θ(new) from the smaller
model Θ through the following optimization problem,
arg min Ex∼D L(x; Θ(new) ), subject to Θ(new) = M (Θ), (3)
M
where D is the data distribution and L is the loss function. It is of course intractable to optimize over
the entire operator space, and thus we further simplify the function M to be a linear transformation,
which results in the following formulation,
2 2
vec(Θ(new) ) = vec(M (Θ)) = M vec(Θ), M ∈ RL2 D2 ×L1 D1 . (4)
This simplified objective is still completely infeasible to apply to contemporary neural networks
where L1 D1 can easily be in the hundreds of millions. We therefore propose an efficient parameter-
ization of M for tractable learning.
Our first step is to decompose the LiGO operator as M = Ldepth Rwidth , where Ldepth and Rwidth
expand the depth and width of model separately. Concretely, we decompose M as
diag(ℓ1,1 ) · · · diag(ℓ1,L1 ) R1
M = .. .. .. .. . (5)
. . . .
diag(ℓL2 ,1 ) · · · diag(ℓL2 ,L1 ) R L1
| {z }| {z }
Ldepth Rwidth
can effectively reduce the complexity of the LiGO operator from O(D12 L1 D22 L2 ) to O(D12 D22 L1 )
and encode architectural knowledge by grouping parameters by layers. Later in Section 3.4, this
representation is also shown to preserve high representation power owing to its connection with
Monarch matrices (Dao et al., 2022; 2019).
The above LiGO operator requires O(D12 D22 L1 ) parameters for Rwidth and O(L1 L2 D22 ) for
Ldepth . The width operator Rwidth is thus still prohibitively expensive given that D1 (and D2 )
can easily be in the hundreds or thousands. In this section, we propose a Kronecker factorization to
further reduce the number of learnable parameters for each growth operator.
4
Published as a conference paper at ICLR 2023
Depth. For depth, we treat an entire layer as a single group and construct a new layer by combining
existing layers, effectively tying parameters for all neurons in same layer. Formally, each block in
Ldepth is simplified to be diag(ℓi,j ) = wi,j I. Then the entire matrix can be written as a Kronecker
factorization, Ldepth = w ⊗ I, where w ∈ RL2 ×L1 is a matrix whose entry wi,j indicates blending
weights of j-th layer of the small model to form i-th layer of the large model. This strategy reduces
the number of parameters in Ldepth to O(L1 L2 ), and is shown on left-hand side of Figure 1.
Width. For width, we decompose each diagonal block of width expansion operator Rwidth using
the Kronecker factorization Rl = Al ⊗ B l , where Al , B l ∈ RD2 ×D1 . Since vec(CAB) =
(B ⊤ ⊗ C) vec(A) (Schacke, 2004), we then have,
A1 ⊗ B 1
Rwidth vec(Θ) =
.. vec(Θ)
(6)
.
AL1 ⊗ B L1
⊤ ⊤
= vec B 1 W 1 A⊤ 1 · · · B L1 W A
L1 L1 . (7)
Here we observe that B l W l A⊤l performs in- and out-dimension expansion by Al and B l , respec-
tively. Each new column/row is a linear combination of columns/rows of small model’s weight
matrix. This factorization, which can be seen as grouping parameters by neurons, reduces the num-
ber of parameters to O(L1 D1 D2 ). Figure 1 (right) illustrates LiGO’s width-expansion operator.
Altogether, we obtain the final parameterization of LiGO operator M :
w1,1 w1,2 · · · w1,L1 A1 ⊗ B 1
M = ... .. .. .. ⊗ I .. (8)
. . . .
wL2 ,1 wL2 ,2 · · · wL2 ,L1 AL1 ⊗ B L1
| {z }| {z }
Depth expansion Width expansion
We can exploit the factorization to implement the LiGO operator (Eq. 8) efficiently.
Training. LiGO expands a model in three steps: (1) for each layer, inserting new rows by lin-
early combining existing rows through B l , (2) for each layer, inserting new columns by linearly
combining existing columns through Al , and then finally (3) reconstructing each layer by linearly
combining the weight matrices with w along the depth. We then run a few steps (e.g., 100 itera-
tions) of SGD to optimize M , which has negligible compute cost relative to regular training. After
obtaining M , we initialize large model with M vec(Θ), and train parameters Θ(new) through SGD
as usual. Algorithm 1 summarizes a forward pass of LiGO with transformer. Finally, as shown in
Appendix A we note that StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all
special cases of LiGO (Eq. 8) with a particular setting of Ldepth and Rwidth .
While LiGO can be applied to any multi-layer neural network architecture, in this paper we focus
on using LiGO to grow transformers which have been shown to be particularly amenable to scaling.
Below we briefly describe how LiGO is applied to the main transformer embedding/attention layers
and defer further details (e.g., growing bias vectors, layer norm parameters) to Appendix B.1.
Embedding layer. The embedding layer can be regarded as a linear layer whose inputs are one-hot
vectors. We learn a matrix B (emb) to extend its output dimension. This embedding layer is also
used as the final output layer for our transformer language modeling experiments.
Attention and feedforward Layers. An attention layer consists of multi-head attention weights
(W Q , W K , W V ) and a linear projection (W O ). Let Akl and B kl where k ∈ {Q, K, V, O} be the l-
th layer’s in- and out-dimension expansion matrices (Eq. 6) for the query, key, value, and projection
matrices. To make sure new input and output channels are aligned across modules, we tie the LiGO
operator as follows: for all l ∈ [L1 ], (1) Akl = (B (emb) )⊤ for ∀k ∈ {Q, K, V }, (2) AO V ⊤
l = (B l ) ,
O (emb)
(3) B l = B . The last constraint is added to take into account the residual connections (Chen
(f c1)
et al., 2021). We similarly tie parameters for the feed-forward networks, Al = (B (emb) )⊤ ,
(f c2) (f c2)
Al = (B (f c1) )⊤
l and B l = B (emb) . Since transformers make heavy use of residual layers
5
Published as a conference paper at ICLR 2023
with skip connections, we found that simply using the same B (emb) to parameterize Akl and B kl
for many layers/modules worked well in practice. This reduces the number of learnable parameters
even further and enables fast learning of M on a small amount of data (100 gradient steps).
It is clear that the block-diagonal matrix R has the identical form to our width growing operator
Rwidth . By applying the permutation matrices P 1 and P 2 to L, L is transformed into exactly
the same form with our depth-growth operator Ldepth in Eq. 5. This implies that our depth-width
decomposition coincides with Monarch sparsification of dense matrices, which generalize butterfly
matrices (Dao et al., 2019) and enjoy rich expressivity properties (Dao et al., 2020; 2022).
4 E XPERIMENTS
We conduct experiments to answer three key research questions. Q1: To what extent can LiGO
improve the training efficiency (FLOPs and wall time) of transformers compared to training from
scratch and other growth operators? Q2: Can LiGO be universally effective across transformers
from different domains (e.g., language and vision) and sizes? Q3: Can models trained using LiGO
achieve similar performance compared to the baselines when transferred to downstream tasks?
Datasets. We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training
BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). We use the public C4 (Raffel et al.,
2020) dataset for training GPT2 (Radford et al., 2019). We use ImageNet (Deng et al., 2009) for
training vision transformers. We use GLUE (Wang et al., 2018), SQuADv1.1 (Rajpurkar et al.,
2016), and SQuADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models. We test
downstream performance of vision transformers (DeiT (Touvron et al., 2021a)) by performing trans-
fer learning on 5 downstream image classification tasks, including CIFAR10 (Krizhevsky et al.,
2009), CIFAR100 (Krizhevsky et al., 2009), Flowers102 (Nilsback & Zisserman, 2008), Stanford-
Cars (Krause et al., 2013), and ChestXRay8 (Wang et al., 2017).
Models. We experiment with growing the following language andd vision transformers: (1)
BERT-Small→BERT-Base, BERT-Base→BERT-Large, BERT-Small→BERT-Large; (2) RoBERTa-
Small→RoBERTa-Base for RoBERTa; (3) GPT2-Base→GPT2-Medium, (4) DeiT-S→DeiT-B, and
(5) CaiT-XS→CaiT-S. BERT-Small has 6 layers with 512 hidden dimensions, while other named
models are their usual sizes. See Appendix B.2 for full details.
Baselines. We compare our approach with the following baselines: (1) training from scratch baseline
where we train the larger transformer without using any smaller pretrained models; (2) progressive
training methods designed for growing depth in transformers (StackBERT (Gong et al., 2019) and
MSLT (Yang et al., 2020)); (3) bert2BERT (Chen et al., 2021) that extends Net2Net (Chen et al.,
2015) for width expansion and stacking for depth expansion; (4) KI (Qin et al., 2021) which uses
distillation for transferring knowledge from the smaller model to the larger model.
Implementation details. We always use 100 gradient steps to learn the LiGO for all models, which
is negligible in terms of FLOPs/wall time compared to full training after initialization. We train both
BERT and RoBERTa models for 400K steps with a warmup of 10K steps. We remove the next-
sentence prediction task (Liu et al., 2019) and use a fixed sequence length of 128 for pretraining
5
While the original BERT (Devlin et al., 2019) paper also uses the Toronto Book Corpus (Zhu et al., 2015),
we do not include it here since it is no longer publicly available.
6
Published as a conference paper at ICLR 2023
Log Perplexity
Log Perplexity
1.85 1.85 1.80
2.4 2.4 2.4
1.80
44.7% 0.0% 1.80 1.75 45.2% 30.3% 0.0%
2.2 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2.2 40.7% 0.0% 2.2 8 10 12 14 16
30 40 50 60
2.0 2.0 2.0
1.8 1.8 1.8
0 1 2 3 4 5 6 7 0 20 40 60 80 0 3 6 9 12 15
FLOPs (1e18) Wall Time (hrs) FLOPs (1e18)
(a) BERT-Small→BERT-Base (b) BERT-Small→BERT-Base (c) BERT-{Small, Base}→BERT-Large
Figure 2: Results on BERT. (a-b) shows validation log perplexity vs. FLOPs and wall time respectively
for training BERT-Base by reusing BERT-Small. (c) shows log perplexity vs. FLOPs in growing BERT-
Small and BERT-Base to BERT-Large. The solid line indicates the final perplexity of the larger model trained
from scratch, while the dotted line represents performance of the smaller model trained from scratch. LiGO
offers about 45% savings in FLOPs and 40% savings in wall time over BERT-Base training from scratch. Our
approach is also flexible in reusing either BERT-Small or BERT-Base for accelerating BERT-Large training.
Table 1: Downstream transfer learning performance on GLUE and SQuAD. All of the results are based on
BERT-Base models trained using the different baselines. LiGO achieves similar or even better performance than
the original training from scratch baseline on several downstream tasks, despite improving training efficiency.
Savings Savings SST-2 MNLI MRPC CoLA QNLI QQP STS-B SQuADv1.1 SQuADv2.0 Avg. Avg.
Method
(FLOPs) (Walltime) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (F1/EM) (F1/EM) GLUE SQuAD
Scratch – – 88.19 78.43 85.78 62.09 87.06 87.18 86.99 86.55 / 77.31 71.31 / 67.07 82.25 78.79 / 72.19
StackBERT 34.1% 33.3% 88.99 79.72 85.29 59.09 87.28 89.17 86.97 86.50 / 77.42 71.32 / 67.41 82.36 78.91 / 72.41
MSLT 34.9% 30.0% 88.53 78.10 82.60 64.76 83.58 88.54 85.89 86.07 / 76.73 70.68 / 67.17 81.72 78.47 / 71.95
KI -5.7% -13.9% 88.65 78.83 83.50 64.86 86.25 88.96 87.09 84.93 / 76.29 71.09 / 67.41 82.59 78.01 / 71.85
bert2BERT 29.0% 25.1% 88.30 80.05 85.54 61.73 88.16 86.18 87.00 86.24 / 77.09 71.52 / 66.85 82.42 78.88 / 71.97
LiGO 44.7% 40.7% 88.42 79.29 84.31 62.09 88.07 88.81 87.00 86.28 / 77.45 71.24 / 67.17 82.57 78.76 / 72.31
both models. For BERT, we use a batch size of 256 and a learning rate of 2e−4 , while we use a
batch size of 1024 and a learning rate of 8e−4 for training RoBERTa models.
Following Shen et al. (2022), we train GPT2 models with a batch size of 384 and sequence length
of 1024. For vision transformers, we build our models based on DeiT (Touvron et al., 2021a) and
CaiT (Touvron et al., 2021b), and apply their default hyper-parameters for training on ImageNet
dataset. We train all our vision transformers for 300 epochs with a batch size of 1024. For transfer
learning with BERT/RoBERTa, we follow Tan & Bansal (2020) and train for 3 epochs with a learn-
ing rate of 1e−4 and a batch-size of 32 for all tasks in GLUE. On SQuAD v1.1 and SQuAD 2.0,
we fine-tune for 2 epochs with a learning rate of 5e−5 and a batch size of 12. We run both GLUE
and SQuAD evaluations three times with different random seeds and report the mean numbers. For
transfer learning experiments on DeiT, we finetune the pretrained models with 1000 epochs, batch
size 768, learning rate 0.01, and use the same data augmentation in training on ImageNet. We use
the same pretraining data and experimental settings for all the baselines (including our approach)
for a fair comparison. Note that we include the additional compute required for training LiGO in
all our tables and figures. However, since our LiGO is only trained for 100 steps, the influence on
visualization and quantitative saving percentages is negligible.
BERT. Figure 2 shows the comparison between the different baselines for training BERT models.
As seen from Figure 2(a), LiGO saves 44.7% computational cost (FLOPs) of training BERT-Base
(12 layers, 768 dimensions) from scratch by reusing BERT-Small (6 layers, 512 dimensions). LiGO
offers 40.7% savings in wall time compared to training from scratch (Figure 2(b)). Among the com-
pared methods, StackBERT is the most competitive in terms of both FLOPs and wall time, although
LiGO obtains +10.6% and +7.2% improvements in FLOPs and wall time on top of StackBERT.
Similarly, LiGO significantly outperforms the recent bert2BERT method which saves about 30%
computational costs. We observe that KI does not provide any real savings in training as it requires
additional computation for knowledge distillation. Figure 2(c) shows that our LiGO approach is
flexible in growing either BERT-Small or BERT-Base for accelerating BERT-Large training. As ex-
pected, reusing BERT-Base instead of BERT-Small leads more savings in FLOPs (45.2% vs 30.3%)
as BERT-Base contains more implicit knowledge in its parameters. Table 1 shows the per-task per-
7
Published as a conference paper at ICLR 2023
Log Perplexity
Log Perplexity
1.85 1.85 3.6 3.20
2.4 2.4 22.5% 0.0%
1.80 3.50 4.00 4.50 5.00 5.50 6.00
47.2% 0.0% 1.80
2.2 2 3 4 5 2.2 51.1% 0.0% 3.4
20 30 40 50 60 70
2.0 2.0
3.2
1.8 1.8
0 1 2 3 4 5 6 0 20 40 60 80 3.0 0 1 2 3 4 5 6 7
FLOPs (1e18) Wall Time (hrs) FLOPs (1e17)
(a) RoBERTa-Small→RoBERTa-Base (b) RoBERTa-Small→RoBERTa-Base (c) GPT2-Base→GPT2-Medium
Figure 3: Results on RoBERTa and GPT2. LiGO reduces FLOPs by 47.2% and 22.5% for RoBERTa-Base
and GPT2-Medium, , demonstrating its effectiveness across different training strategies and architectures.
formance of different BERT-Base models on both GLUE and SQuAD benchmarks, where we find
that BERT trained with LiGO achieves very similar performance compared to the baselines on both
benchmarks. Finally, in Table 5 of the Appendix C.3, we show that growing BERT-Small to BERT-
Base with 100 steps of LiGO and then finetuning on GLUE tasks without additional pretraining
outperforms just directly finetuning BERT-Small.
RoBERTa and GPT2. Figure 3(a-b) shows the results on RoBERTa, whose training recipe uses
larger batch size and learning rate than BERT. LiGO similarly accelerates RoBERTa training, which
indicates that our method is robust to optimization hyperparameters. On GPT2, LiGO saves 22.5%
computation cost of training GPT2-Medium (345M parameters) by reusing GPT2-Base (117M pa-
rameters) (Figure 3(c)). These consistent improvements show that LiGO is effective for accelerating
transformer training across different model architectures and sizes.
Vision Transformers. Fig-
ure 4 shows that by grow- 85 85
ing from DeiT-S, LiGO can 80 80
save 55.4% FLOPs and 52% 75 75
81.5 81.5
GPU wall time to reach the
Valid Accuracy
Valid Accuracy
8
Published as a conference paper at ICLR 2023
Log Perplexity
Log Perplexity
1.85 1.85 1.85
2.4 2.4 2.4
1.80 49.4% 0.0% 1.80 1.80
52.1% 0.0% 52.9% 36.2% 0.0%
2.2 2 3 4 5 2.2 2 3 4 5 2.2 2 3 4 5
and 8.2% with layer dropping, token dropping and staged training, respectively. Following (Chen
et al., 2021), we also apply staged training strategy to bert2BERT and observe that LiGO still out-
performs bert2BERT with staged training by 16.7% (see Figure 5(c)).
4.3 A BLATION S TUDIES
3.0 3.0
Scratch MSLT Scratch AKI
2.8 2.8 StackBert LiGO DirectCopy LiGO
Depth-only expansion. We examine the FPI InterBert
1.90 1.90
effectiveness of our proposed depth ex- 2.6 2.6
Log Perplexity
Log Perplexity
1.85 1.85
pansion operator (Ldepth ) by only grow- 2.4 2.4
1.80 1.80
51.9% 0.0% 41.6% 0.0%
ing the depth of BERT from 6 layers to 2.2 2.2 2 3 4 5 2 3 4 5
5 C ONCLUSION
This paper describes an approach for accelerating transformer training by learning to grow pre-
trained transformers, where the larger transformer’s parameters are initialized as a linear mapping
from the smaller pretrained model’s parameters, The linear map is factorized to be a composition of
sparse width- and depth-expansion operators with a Kronecker factorization that groups parameters
into layers and neurons. We demonstrate the effectiveness of our proposed approach on both lan-
guage and vision transformers of different sizes, outperforming several competing methods. While
our compute resources prevented us from applying LiGO to even larger transformers, it would be
interesting to see if this can be applied on top of even larger models.
9
Published as a conference paper at ICLR 2023
ACKNOWLEDGMENTS
PW sincerely thanks Zhen Wang for the insightful discussion and for providing reference reposi-
tories for language model pre-training. PW also appreciates Hao Tan’s assistance for reproducing
fine-tuning results on GLUE datasets. YK and LTH were partially supported an MIT-IBM Watson
AI grant and an Amazon award. We also acknowledge support from the IBM Research AI Hard-
ware Center, and the Center for Computational Innovation at Rensselaer Polytechnic Institute for the
computational resources on the AiMOS Supercomputer. The research of ZW is in part supported by
the US Army Research Office Young Investigator Award (W911NF2010240).
R EFERENCES
Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale
image recognition without normalization. In International Conference on Machine Learning, pp.
1059–1071, 2021.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. In Proceedings of NeurIPS, 2020.
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture search by
network transformation. In Proceedings of AAAI, 2018.
Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual net-
works from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao
Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. arXiv
preprint arXiv:2110.07143, 2021.
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary
differential equations. Advances in neural information processing systems, 31, 2018.
Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge
transfer. arXiv preprint arXiv:1511.05641, 2015.
Justin Chiu, Yuntian Deng, and Alexander Rush. Low-rank constraints for fast inference in struc-
tured models. Advances in Neural Information Processing Systems, 34:2887–2898, 2021.
Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: A neural network synthesis tool based on a
grow-and-prune paradigm. IEEE Transactions on Computers, 68(10):1487–1497, 2019.
Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher Ré. Learning fast algorithms for
linear transforms using butterfly factorizations. In International conference on machine learning,
pp. 1517–1527, 2019.
Tri Dao, Nimit S Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri
Rudra, and Christopher Ré. Kaleidoscope: An efficient, learnable representation for all structured
linear maps. arXiv preprint arXiv:2012.14966, 2020.
Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu,
Aniruddh Rao, Atri Rudra, and Christopher Ré. Monarch: Expressive structured matrices for
efficient and accurate training. In International Conference on Machine Learning, pp. 4690–
4721, 2022.
Yann N Dauphin and Samuel Schoenholz. Metainit: Initializing learning by learning to initialize.
Advances in Neural Information Processing Systems, 32, 2019.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi-
erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pp. 248–255, 2009.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.
10
Published as a conference paper at ICLR 2023
Chengyu Dong, Liyuan Liu, Zichao Li, and Jingbo Shang. Towards adaptive residual network
training: A neural-ode perspective. In International conference on machine learning, pp. 2616–
2626. PMLR, 2020.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko-
reit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale. In Proceedings of ICLR, 2021.
Utku Evci, Max Vladymyrov, Thomas Unterthiner, Bart van Merriënboer, and Fabian Pe-
dregosa. Gradmax: Growing neural networks using gradient information. arXiv preprint
arXiv:2201.05125, 2022.
Scott Fahlman. The recurrent cascade-correlation architecture. In Advances in Neural Information
Processing Systems, 1990.
Scott Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In Advances in
Neural Information Processing Systems, 1989.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and
statistics, pp. 249–256, 2010.
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of
bert by progressively stacking. In International conference on machine learning, pp. 2337–2346,
2019.
Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, and Jiawei Han. On the transformer
growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
Steven Gutstein, Olac Fuentes, , and Eric Freudenthal. Knowledge transfer in deep convolutional
neural nets. In Proceedings of International Journal on Artificial Intelligence Tools, 2008.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, and Denny
Zhou. Token dropping for efficient bert pretraining. arXiv preprint arXiv:2203.13240, 2022.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.
In International Conference on Machine Learning, 2019.
Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. Improving transformer opti-
mization through better initialization. In International Conference on Machine Learning, pp.
4475–4483, 2020.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural
networks using pipeline parallelism. Advances in neural information processing systems, 32,
2019.
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and
Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language
models. arXiv preprint arXiv:2001.08361, 2020.
Yannic Kilcher, Gary Bécigneul, and Thomas Hofmann. Escaping flat areas via function-preserving
structural network modifications. 2018.
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained
categorization. In Proceedings of the IEEE international conference on computer vision work-
shops, pp. 554–561, 2013.
11
Published as a conference paper at ICLR 2023
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691, 2021.
Changlin Li, Bohan Zhuang, Guangrun Wang, Xiaodan Liang, Xiaojun Chang, and Yi Yang. Au-
tomated progressive learning for efficient training of vision transformers. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12486–12496, 2022.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019.
Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,
2015.
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number
of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing,
pp. 722–729. IEEE, 2008.
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapter-
fusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247,
2020.
Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu,
Peng Li, Maosong Sun, et al. Knowledge inheritance for pre-trained language models. arXiv
preprint arXiv:2105.13880, 2021.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing by generative pre-training. 2018.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions
for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for squad. arXiv preprint arXiv:1806.03822, 2018.
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo,
Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function
emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the
National Academy of Sciences, 118(15), 2021. ISSN 0027-8424. doi: 10.1073/pnas.2016239118.
Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction
of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019.
Kathrin Schacke. On the kronecker product. Master’s thesis, University of Waterloo, 2004.
Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training
for transformer language models. arXiv preprint arXiv:2203.06211, 2022.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par-
allelism. arXiv preprint arXiv:1909.08053, 2019.
Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep
learning. Advances in Neural Information Processing Systems, 28, 2015.
12
Published as a conference paper at ICLR 2023
Hao Tan and Mohit Bansal. Vokenization: Improving language understanding with contextualized,
visual-grounded supervision. arXiv preprint arXiv:2010.06775, 2020.
Shanshan Tang, Bo Li, and Haijun Yu. Chebnet: Efficient and stable constructions of deep
neural networks with rectified power units using chebyshev approximations. arXiv preprint
arXiv:1911.05467, 2019.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and
Hervé Jégou. Training data-efficient image transformers & distillation through attention. In
International Conference on Machine Learning, pp. 10347–10357, 2021a.
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going
deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2021b.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In Proceedings of NeurIPS,
2017.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue:
A multi-task benchmark and analysis platform for natural language understanding. In Proceedings
of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for
NLP, pp. 353–355, 2018.
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Sum-
mers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised
classification and localization of common thorax diseases. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 2097–2106, 2017.
Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network morphism. In Maria Florina
Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on
Machine Learning, pp. 564–572, 2016.
Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for growing neural architectures.
Advances in neural information processing systems, 32, 2019.
Lemeng Wu, Dilin Wang, Peter Stone, and Qiang Liu. Firefly neural architecture descent: a general
approach for growing neural networks. Advances in neural information processing systems, 2021.
Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, and Jingqiao Zhang. Progressively
stacking 2.0: A multi-stage layerwise training method for bert training speedup. arXiv preprint
arXiv:2011.13635, 2020.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep
learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without
normalization. arXiv preprint arXiv:1901.09321, 2019.
Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with
progressive layer dropping. Advances in Neural Information Processing Systems, 33:14011–
14023, 2020.
Xu Zhang, Felix X Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang, and Shi-Fu Chang. Fast orthogo-
nal projection based on kronecker product. In Proceedings of the IEEE International Conference
on Computer Vision, pp. 2929–2937, 2015.
Chen Zhu, Renkun Ni, Zheng Xu, Kezhi Kong, W Ronny Huang, and Tom Goldstein. Gradinit:
Learning to initialize neural networks for stable and efficient training. Advances in Neural Infor-
mation Processing Systems, 34:16410–16422, 2021.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and
Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In Proceedings of the IEEE international conference on computer
vision, pp. 19–27, 2015.
13
Published as a conference paper at ICLR 2023
A U NIVERSALITY OF L I GO O PERATOR
Proposition 1. StackBERT (Eq. 1), Interpolation (Eq. 1), and Net2Net (Eq. 2) are all the special
cases of the LiGO operator (Eq. 8).
Stacking. Stacking-based methods (Gong et al., 2019; Yang et al., 2020) duplicate the entire lower
blocks on top of the small model to the form new layers (Eq. 1). Formally, we show this operation
can be done by the following operator:
I
I
.. I
.
..
M = (9)
I
.
I
I
. . | Rwidth
{z }
.
| {z }
Ldepth
Interpolation. Interpolation based methods (Chang et al., 2017; Dong et al., 2020) interleave each
layer for twice. We can construct the following matrix to achieve layer interpolation (Eq. 1).
I
I
I I
M =
I ..
(10)
.
..
. I
. . | Rwidth
{z }
.
| {z }
Ldepth
We remark that any rearrangement of layers to construct new layers (mathematically a permutation
of existing layers with replacement) can be constructed in a similar way.
B I MPLEMENTATION D ETAILS
B.1 G ROWING T RANSFORMERS WITH L I GO
The transformer architecture consists of an embedding layer, multi-block attention layer, and an
output layer. The core ingredient attention block consists of a Multi-Head Attention (MHA) module
followed by a FeedForward Network (FFN), with a skip connection across the both blocks. Applying
LiGO requires the following considerations:
14
Published as a conference paper at ICLR 2023
Embedding layer. For both language and vision transformers, the embedding layer can be re-
garded as a linear layer, whose inputs are one-hot embeddings in language models. We draw a
learnable matrix B (emb) to extend its output dimension.
Feed-forward networks. Each attention block is followed by a two-layer FFN. Let Akl and B kl
with k ∈ {f c1, f c2} be the in- and out-dimension expansion matrices (Eq. 6) for the first and
second FFN layer in the l-th layer, respectively. We tie the parameters for feed-forward networks:
(f c1) (f c2) (f c1)⊤ (f c2)
Al = B (emb)⊤ , Al = Bl and B l = B (emb) .
Output layer. For output head, we have A(out) = B (emb)⊤ , since the output dimension of at-
tention layers are always aligned with B (emb) by our construction. The output layer does not need
out-dimension expansion. Algorithm 1 summarizes LiGO for growing transformers.
We summarize the settings of different transformer models used for our experiments in Table 4. For
BERT and RoBERTa, we re-use the code base provided by Tan & Bansal (2020). For GTP2, we
follow the model configuration of OpenAI and use the pre-training code provided by Shen et al.
(2022). For DeiT, we use their official codebase (Touvron et al., 2021a).
For layer dropping, we follow the same progressive dropping rate schedule with Zhang & He (2020),
and set the maximum dropping rate to 0.1 to recover the performance. For token dropping, we
randomly set 15% tokens aside in the middle layers. In the first 50k steps of staged training, only a
sub-network is activated and trained, and afterwards, we perform full-model training for 350k steps.
C A DDITIONAL E XPERIMENTS
LiGO focuses on utilizing the knowledge of smaller models that have already been pretrained and
available. In this section, we investigate how LiGO can leverage smaller existing models that are
only trained for few steps to accelerate training of a larger model. We perform an experiment on
BERT-Base by reusing a BERT-Small trained for only 50k steps instead full training for 220k steps
as used in our experiments. Figure 7 shows that LiGO can still save 35.2% savings in FLOPs and
30.2% savings in wall time over the BERT-Base training from scratch.
15
Published as a conference paper at ICLR 2023
C.2 R ESULTS ON C AI T
In addition to DeiT (Touvron et al., 2021a), we perform additional experiments with CaiT (Touvron
et al., 2021b) on ImageNet and find that while reusing CaiT-XS, LiGO offers about 52.6% savings
in FLOPs and 46.1% savings in wall time over the CaiT-S training from scratch (see Figure 8).
LiGO is mainly proposed for improving efficiency of the pre-training stage and hence is compatible
with various finetuning schemes like full model finetuning, adapters (Houlsby et al., 2019; Pfeiffer
et al., 2020) or prompt tuning (Lester et al., 2021; Jia et al., 2022) for adaptation to downstream
tasks. We test BERT-Base models trained using different baselines by using adapterfusion (Pfeiffer
et al., 2020) instead of full finetuning on GLUE benchmark. Table 6 shows that LiGO also achieves
16
Published as a conference paper at ICLR 2023
3.0 3.0
Scratch LiGO from 50k Scratch LiGO from 50k
2.8 LiGO 2.8 LiGO
1.90 1.90
2.6 2.6
Log Perplexity
Log Perplexity
1.85 1.85
2.4 2.4
1.80
44.7% 35.2% 0.0% 1.80
2.2 2.5 3.0 3.5 4.0 4.5 5.0 5.5 2.2 40.7% 30.2% 0.0%
30 40 50 60
2.0 2.0
1.8 1.8
0 1 2 3 4 5 6 7 0 20 40 60 80
FLOPs (1e18) Wall Time (hrs)
(a) BERT-Small→BERT-Base (b) BERT-Small→BERT-Base
Figure 7: Results on BERT-Base by reusing BERT-Small trained for 50k steps. Instead of training BERT-Base
from fully trained BERT-Small, we run LiGO on BERT-Small trained with 50k steps. LiGO offers about 35.2%
savings in FLOPs and 30.2% savings in wall time over the BERT-Base training from scratch.
85 85
Scratch LiGO Scratch LiGO
80 80
75 81.5 75 81.5
Valid Accuracy
Valid Accuracy
81.0 81.0
70 52.6% 0.0% 70 46.1% 0.0%
80.5 80.5
65 80.0 65 80.0
79.52 4 6 8 10 12 14 79.5 10 15 20 25
60 60
55 55
50 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 50 0 5 10 15 20 25 30 35
FLOPs (1e18) Wall Time (hrs)
(a) CaiT-XS→CaiT-S (b) CaiT-XS→CaiT-S
Figure 8: Results on CaiT. (a) Accuracy vs. flops and (b) accuracy vs. wall time for training CaiT-S. LiGO
saves flops by 52.6% and wall time by 46.1% over training from scratch on ImageNet.
Table 5: GLUE performance of different LiGO models. All of the results are based on BERT-Base models
with BERT-Small as the base model for LiGO optimization.
SST-2 MNLI MRPC CoLA QNLI QQP STS-B Average
Method
(Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.)
BERT-Small (Scratch) 87.21 77.56 82.11 59.93 85.06 85.82 84.99 80.38
BERT-Base (LiGO Init) 88.15 77.62 82.53 60.70 85.79 86.65 85.83 81.04
BERT-Base (LiGO Init + Pretrain) 88.42 79.29 84.31 62.09 88.07 88.81 87.00 82.57
BERT-Base (Scratch) 88.19 78.43 85.78 62.09 87.06 87.18 86.99 82.25
17
Published as a conference paper at ICLR 2023
Table 6: Downstream performance using AdapterFusion (Pfeiffer et al., 2020) on GLUE Benchmark. All of
the results are based on BERT-Base models trained using different baselines.
Savings Savings SST-2 MNLI MRPC CoLA QNLI QQP STS-B Average
Method
(FLOPs) (Walltime) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.) (Acc.)
Scratch – – 88.41 78.60 86.02 62.39 87.62 88.02 86.52 82.51
StackBERT 34.1% 33.3% 88.78 79.80 85.43 59.56 87.71 89.19 86.27 82.39
MSLT 34.9% 30.0% 88.41 78.35 83.15 63.97 86.19 88.20 86.42 82.10
KI -5.7% -13.9% 88.94 78.84 84.00 64.61 86.75 88.19 87.93 82.75
bert2BERT 29.0% 25.1% 88.47 80.53 85.50 62.33 88.57 86.72 87.10 82.75
LiGO 44.7% 40.5% 88.45 80.01 84.67 63.05 88.06 88.92 87.00 82.88
on-par performance with model trained from scratch under adapter-based tuning with 44.7% savings
in FLOPs abd 40.5% savings in wall time. This shows that LiGO does not harm the model general-
ization capability when adapters are used as a parameter-efficient finetuning strategy for transferring
a trained model to downstream datasets.
Our extensive experiments on BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), GPT2 (Rad-
ford et al., 2019), DeiT (Touvron et al., 2021a) and CaiT (Touvron et al., 2021b) show that LiGO
can consistently improve transformer training efficiency over the traditional way of training from
scratch across domains and model sizes. One interesting future direction of our work is scaling
LiGO to very large models with parameters more than 100B, such as GPT3 (Brown et al., 2020).
While we currently do not possess the compute resources for this extreme large-scale study, we per-
form a preliminary experiment on GPT2-1.5B (Radford et al., 2019) by using GPT2-Medium as the
initialization. We train for 15k steps on C4 dataset (Raffel et al., 2020) and find that our proposed
LiGO saves about 39% computation cost (FLOPs) of training GPT2-1.5B from scratch to reach the
same log perplexity (which is 3.3). We believe that it is imperative to study the extent to which the
benefits of LiGO remain at the scale on which the modern large language models are trained. We
hope to cover this in our future work.
18