Efficientbert: Progressively Searching Multilayer Perceptron Via Warm-Up Knowledge Distillation
Efficientbert: Progressively Searching Multilayer Perceptron Via Warm-Up Knowledge Distillation
Abstract 78 EfficientBERT++
EfficientBERT
MobileBERTTINY DistilBERT6
Pre-trained language models have shown re-
BERT-PKD6
markable results on various NLP tasks. Never- 76 EfficientBERTTINY
theless, due to their bulky size and slow infer-
arXiv:2109.07222v2 [cs.CL] 16 Sep 2021
Final Score
TinyBERT4
ence speed, it is hard to deploy them on edge
devices. In this paper, we have a critical in- 74
sight that improving the feed-forward network
BERT-PKD4
(FFN) in BERT has a higher gain than im- 72 BERTSMALL DistilBERT4
proving the multi-head attention (MHA) since
the computational cost of FFN is 2∼3 times
larger than MHA. Hence, to compact BERT, 70 BERTTINY
we are devoted to designing efficient FFN as 50 75 100 125 150 175 200
Latency (ms)
opposed to previous works that pay attention
to MHA. Since FFN comprises a multilayer
Figure 1: Final score vs. latency tradeoff curve. Final
perceptron (MLP) that is essential in BERT
score refers to the average score on the GLUE test set.
optimization, we further design a thorough
search space towards an advanced MLP and
perform a coarse-to-fine mechanism to search
for an efficient BERT architecture. Moreover, processing (NLP) tasks. Nevertheless, their short-
to accelerate searching and enhance model comings are still evident (including a considerable
transferability, we employ a novel warm- model size and low inference efficiency), limiting
up knowledge distillation strategy at each real-world application scenarios.
search stage. Extensive experiments show our To alleviate the aforementioned limitations,
searched EfficientBERT is 6.9× smaller and many model compression methods have been pro-
4.4× faster than BERTBASE , and has com-
posed, including quantization, weight pruning, and
petitive performances on GLUE and SQuAD
Benchmarks. Concretely, EfficientBERT at- knowledge distillation (KD) (Shen et al., 2020;
tains a 77.7 average score on GLUE test, 0.7 Michel et al., 2019; Jiao et al., 2020). Among
higher than MobileBERTTINY , and achieves them, KD (Hinton et al., 2015) that transfers the
an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 knowledge from larger teacher models to smaller
dev, 3.2/2.7 higher than TinyBERT4 even with- student models with minimal performance sacri-
out data augmentation. The code is released fice is most widely used due to its plug-and-play
at https://github.com/cheneydon/
feasibility and its scalability in the rapid delivery
efficient-bert.
of new models. Specifically, KD allows us to train
1 Introduction our own BERT architecture significantly faster than
training from scratch. Hence, we adopt KD in this
Diverse pre-trained language models (PLMs) (e.g., paper. Besides, inspired by the impressive results
BERT (Devlin et al., 2019)) have been intensively by neural architecture search (NAS) in vision tasks
investigated by designing new pretext tasks, ar- (Howard et al., 2019; Li et al., 2020; Cai et al.,
chitectures, or attention mechanisms (Yang et al., 2020), adopting NAS to further boost the perfor-
2019; Jiang et al., 2020; Beltagy et al., 2020). The mance of PLMs or reduce the computational cost
performances of these PLMs far exceed the tra- has attracted increasing attentions (So et al., 2019;
ditional methods on a variety of natural language Wang et al., 2020a; Chen et al., 2020).
∗
Corresponding author. Although considerable progress has been made
in the field of KD for PLMs, the compression of is allowed, i.e., each model can inherit weights
the feed-forward network (FFN) has been rarely from this activated warmed-up supernet for a quick
studied. This contradicts the fact that the computa- launch and is then trained with weight sharing in a
tional cost of FFN is 2∼3 times larger than that of multi-task manner to enhance transferability.
the multi-head attention (MHA). In addition, Dong Extensive experimental results show that our
et al. (2021) have proved that the multilayer per- searched architecture, named EfficientBERT, is
ceptron (MLP) in FFN can prevent the undesirable 6.9× smaller and 4.4× faster than BERTBASE ,
rank collapse caused by self-attention and thus can and has competitive performance. On the test set
improve BERT optimization. These motivate us to of GLUE benchmark, EfficientBERT attains an
investigate the nonlinearity of FFN in BERT. average score of 77.7, which is 0.7 higher than
In this paper, we make the first attempt to com- MobileBERTTINY , and achieves an F1 score of
press and improve the barely-explored multilayer 85.3/74.5 on the SQuAD v1.1/v2.0 dev dataset,
perceptron (MLP) in FFN and propose a novel which is 3.2/2.7 higher than TinyBERT4 even with-
coarse-to-fine NAS approach with warm-up KD out data augmentation.
to find the optimal MLP architectures, aiming to
search for a universal small BERT model with 2 Related Work
competitive performance and strong transferabil-
ity. Specifically, we design a rich and flexible Compression for Pre-trained Language Models.
search space to discover an excellent FFN with For the past few years, pre-trained language models
maximal nonlinearity and a minimal computational (PLMs) have demonstrated their strong powers on
cost. Our search space contains various mathemati- a variety of NLP tasks with the trend of larger and
cal operations, stack numbers, and expansion ratios larger model size as well as better results. However,
of intermediate hidden size. To efficiently search it is hard to deploy them on resource-limited edge
from our vast search space, we progressively shrink devices for practical usage. To solve this problem,
the search space in three stages. many efficient PLMs have been proposed (Turc
et al., 2019; Lan et al., 2020). For example, Turc
• Stage 1: Perform a coarse search to explore the et al. (2019) directly pre-train and fine-tune smaller
entire search space (i.e., jointly searching the BERT models. In addition, many compression tech-
mathematical operations, stack numbers, and ex- niques for PLMs have been proposed recently to
pansion ratios)(Figure 2 (a)). reduce the training cost, including quantization,
• Stage 2: Fix the stack numbers and expansion weight pruning, and knowledge distillation (KD)
ratios, performing a fine-grained search for the (Shen et al., 2020; Sajjad et al., 2020; Jiao et al.,
optimal mathematical operations (Figure 2 (b)). 2020). Among them, KD (Hinton et al., 2015) is
• Stage 3: Fix the mathematical operations, per- widely used due to its plug-and-play feasibility,
forming a fine-grained search for optimal stack which aims to transfer the knowledge from larger
numbers and expansion ratios (Figure 2 (c)). teacher models to smaller student models without
Even with this elegant coarse-to-fine search strat- sacrificing too much performance. For example,
egy, pre-training each candidate model still needs a BERT-PKD (Sun et al., 2019) jointly distills the in-
lot of time to converge during searching. To solve termediate and last layers during fine-tuning. Distil-
this problem, different from the conventional KD BERT (Sanh et al., 2020) distills the last layers with
strategy (Jiao et al., 2020), we propose a warm-up a triple loss during pre-training. MobileBERT (Sun
KD strategy to fast transfer the knowledge, where et al., 2020) designs an inverted-bottleneck model
a pre-trained supernet is additionally introduced to structure and progressively transfers the knowl-
perform a joint warm-up for all candidate models. edge during pre-training. MiniLM (Wang et al.,
Note that the warm-up strategy in the third stage is 2020b) performs a deep self-attention distillation
slightly different from that of the first two stages. during pre-training. TinyBERT (Jiao et al., 2020)
During the first two stages, each candidate model introduces a comprehensive Transformer distilla-
initially inherits its weights from a frozen warmed- tion method during pre-training and fine-tuning.
up supernet to accelerate searching. But in the third Nevertheless, the compression of the feed-
stage, since there is no need to search mathemat- forward network (FFN) has not been well studied,
ical operations, an unfrozen warmed-up supernet although its computational cost is 2∼3 larger than
sharing weights across different candidate models the multi-head attention (MHA) as pointed out by
Search Space (Search stage 1)
Warm-up KD (Search stage 1, 2)
FFN mathematical FFN stack FFN intermediate Add & Norm
operation number expansion ratio Search Space Mathematical Operations MNLI
Search Space (Search stage 2) Frozen supernet Candidate subnet Single-task dataset
FFN mathematical operation
Add & Norm
Search Space (Search stage 3) …
Mathematical Operations
FFN stack number FFN intermediate expansion ratio Linear Operations
MNLI RTE
Activated supernet Candidate subnet Multi-task dataset
Classifier Classifier
1 … …
Linear Operations
GeLU
x
Linear Operations
Embedding Embedding
Student Model Mathematical Operations Mathematical Operations
Teacher Model
Max
Activated operations Unused operations
Tanh
Activated channels Unused channels L L L
L Activated stack number L Frozen stack number x x x
Activated path Frozen path (a) Search Stage 1 (b) Search Stage 2 (c) Search Stage 3
Figure 2: An overview of the search procedure of our EfficientBERT. The teacher model is BERTBASE (left), and
the search space of our student model is designed towards achieving better nonlinearity of FFN, which contains
mathematical operations, stack numbers, and intermediate expansion ratios (right). During searching, we progres-
sively shrink the search space and divide the search process into three stages to conduct NAS ((a)-(c)). Both in
the search and retraining stages, we use a novel warm-up knowledge distillation to transfer the teacher model’s
knowledge (middle). Specifically, each candidate or retrained subnet first inherits the weights from a frozen or
activated warmed-up supernet, then conducts pre-training and fine-tuning with single-task or multi-task dataset.
Iandola et al. (2020). In contrast, compressing FFN Table 1: Mathematical operations of FFN. Following
is our main focus in this work. the original papers of GeLU (Hendrycks and Gimpel,
2016) andp Leaky ReLU (He et al., 2015), we let c1 =
Neural Architecture Search. Motivated by the 0.5, c2 = 2/π, c3 = 0.044715, c4 = 0.01.
success of neural architecture search (NAS) in com-
Operation Expression Arity
puter vision (Howard et al., 2019; Li et al., 2020;
Cai et al., 2020), increasing attention has been paid Add x+y 2
Mul x×y 2
to applying NAS to NLP tasks (So et al., 2019; Max max(x, y) 2
Wang et al., 2020a; Chen et al., 2020), aiming to GeLU c1 x(1 + tanh(c2 (x + c3 x3 ))) 1
automatically search for optimal architectures from Sigmoid 1/(1 + e−x ) 1
Tanh (e − e−x )/(ex + e−x )
x
1
a vast search space. Evolved Transformer (So et al., ReLU max(x, 0) 1
2019) employs NAS to search for a better Trans- Leaky ReLU x if x ≥ 0 else c4 x 1
former architecture with an evolutionary algorithm. ELU x if x ≥ 0 else ex − 1 1
Swish x/(1 + e−x ) 1
HAT (Wang et al., 2020a) applies NAS to search
for efficient hardware-aware Transformer models
based on a Transformer supernet. AdaBERT (Chen
for a small universal BERT with competitive per-
et al., 2020) searches for task-adaptive small mod-
formance and strong transferability. And unlike
els with KD and differentiable NAS method. NAS-
NAS-BERT, we design a much more flexible search
BERT (Xu et al., 2021) proposes a task-agnostic
space and use warm-up KD with a coarse-to-fine
NAS method for adaptive-size model compression,
searching paradigm to accelerate searching and en-
where several acceleration techniques (including
hance model transferability.
block-wise search, search space pruning, and per-
formance approximation) are introduced to speed 3 Our EfficientBERT
up the searching process.
Differently, in this paper, we design a compre- We aim at discovering a lightweight MLP architec-
hensive search space towards the nonlinearity of ture with better nonlinearity in each FFN layer, en-
multilayer perceptron (MLP) in FFN, and pro- suring the searched model can achieve compelling
pose a novel coarse-to-fine NAS approach with performance. We first present the search space
warm-up KD to find the optimal MLP architec- towards better nonlinearity of MLP in FFN, as de-
tures. Unlike AdaBERT, we apply NAS to search scribed in Section 3.1. Then we propose a novel
coarse-to-fine NAS method with warm-up KD as operations are optionally placed in the intermediate
discussed in Section 3.2. nodes to process the hidden states. More details of
our search space can be found in Figure 2.
3.1 Search Space Design
3.2 Neural Architecture Search
In a standard Transformer layer, there are two main
Base Model Design. As discussed by previous
components: a multi-head attention (MHA) and
BERT compression works (Jiao et al., 2020; Sun
a feed-forward network (FFN). Theoretically, the
et al., 2020), there are several strategies to reduce
computation (Mult-Adds) of MHA and FFN is
the model size, including the embedding factoriza-
O(4Ld2 + L2 d) and O(2 × 4Ld2 ) respectively,
tion and model width/depth reduction. However,
where L is the sequence length and d is the chan-
most of the recent works only consider part of these
nel number. As d gets larger, the computation of
strategies. In our work, we design a base model
FFN gets larger than MHA. And as pointed out by
with all these strategies to make a comprehensive
previous works (Iandola et al., 2020), the latency
compression. Besides, we find that the expansion
of MHA and FFN in each layer of BERTBASE ac-
ratio of intermediate hidden size in FFN contributes
counts for about 30% and 70% on a Google Pixel 3
a lot to the model size and inference latency. Thus
smartphone, and the parameter numbers for MHA
the reduction of the intermediate expansion ratio is
and FFN are about 2.4M and 4.7M, respectively.
also considered. The detailed settings of our base
These demonstrate the potential of compressing
model can be found in Section 4.2.
FFN, i.e., compressing FFN may be more promis-
ing than squeezing MHA. In addition, as discussed Coarse-to-Fine NAS with Warm-up KD. To
by Dong et al. (2021), the MLP in FFN can prevent speed up the search in the vast search space, we
an optimization problem, i.e., rank collapse, caused propose a coarse-to-fine NAS method by progres-
by self-attention; thus, the nonlinearity ability of sively shrinking the search space. The search pro-
FFN deserves to be investigated. Hence, our main cess is divided into three stages where a coarse-
focus is compression and improvement of FFN. grained search is conducted in the first stage to
We then design a search space towards the non- jointly search all of the factors in our search space,
linearity of MLP in FFN to search for a model and fine-grained searches are conducted in the last
with better nonlinearity of FFN and increase the two stages to search for partial factors.
performance. Many factors determine the FFN non- In the first search stage, we jointly search all of
linearity, such as the mathematical operations and the factors in our search space, including the math-
the expansion ratios of intermediate hidden size. ematical operations, stack numbers, and intermedi-
Inspired by MobileBERT (Sun et al., 2020), we ate expansion ratios. We use a DAG computation
find that by increasing the stack number of FFN, graph described in §3.1 to represent each MLP ar-
the model performance can also be remarkably im- chitecture. The initial search candidates are based
proved. We integrate all of the above factors into on our base model, but different stack numbers and
our search space, including the mathematical oper- intermediate expansion ratios of FFN are allowed.
ations, stack numbers, and intermediate expansion During searching, each candidate model is first
ratios of FFN. (1) Mathematical operation: We sampled by a learnable sampling decision tree as
define some primitive operations (including several proposed in LaNAS (Wang et al., 2019). Then
binary aggregation functions and unary activation warm-up KD is employed on each candidate model
functions) and search their different combinations, to accelerate the search process. Since we need
as shown in Table 1. (2) Stack number: The stack to search for the mathematical operations, we can-
number of FFN is selected from {1, 2, 3, 4}. (3) not share the weights of different candidate models
Intermediate expansion ratio: The intermediate ex- to avoid the potential interference problems. In-
pansion ratio is selected from {1, 1/2, 1/3, 1/4}. stead, we first build a warmed-up supernet based
Note that the stack number and the intermediate ex- on our base model with the maximum FFN stack
pansion ratio are jointly considered to balance the number and intermediate expansion ratio in our
computation cost, e.g., network parameters. We use search space. The supernet is pre-trained entirely
a directed acyclic graph (DAG) to represent each (i.e., complete graph) with KD. The weights of the
FFN architecture when searching the mathematical supernet are then frozen. When training each can-
operations. The mathematical operations and linear didate model, we first inherit its weights from the
supernet. Precisely, the weights of each stacked calculated by the mean square error (MSE) loss as:
FFN are sliced from bottom to top layer; and the
h
weights of each linear operation are sliced from 1X
Lm
attn = MSE(ASi,m , ATi,n ), (1)
left to right channel. After that, we only need to h
i=1
pre-train and fine-tune each model for a few steps
via KD to adjust the inherited weights. This signifi- where ASi,m and ATi,n refer to the i-th head of atten-
cantly reduces the search cost. tion matrices at m-th student layer and its matching
n-th teacher layer, respectively, and h is the number
In the second search stage, to discover more
of attention heads. The Transformer-layer output
diversified mathematical operations and evaluate
loss at the m-th student layer Lm hidn and the embed-
their effects, we search them individually with the
ding loss Lembd can be formulated as:
same method in the first search stage. The ini-
tial search candidates are built upon the searched ( m
Lhidn = MSE(HSm Wh , HTn )
model of the first search stage (i.e., we fix the stack , (2)
numbers and expansion ratios). The sampling and Lembd = MSE(ES We , ET )
KD strategies are the same as the first search stage.
where HSm and HTn are the Transformer-layer out-
In the third search stage, we jointly search the puts at m-th student layer and its matching n-th
stack numbers and intermediate expansion ratios in teacher layer, respectively. E is the embedding,
the search space to explore their potentials further. and two learnable transformation matrices Wh and
The initial search candidates are based on the sec- We are applied to align the mismatch dimensions
ond search stage’s searched model; the searched between the student and teacher models. More-
mathematical operations are fixed, but different over, the prediction loss Lpred calculated by the
stack numbers and intermediate expansion ratios soft cross-entropy (CE) loss can be formulated as:
of FFN are allowed. We also apply warm-up KD
to accelerate the searching. Specifically, we first Lpred = CE(zS /t, zT /t), (3)
warm up the supernet entirely (i.e., complete graph)
where z is the predicted logits vector, and t is the
via KD again but do not freeze its weights. Then
temperature value. Finally, we combine all of the
we share the weights of different candidate models
above losses and derive the overall KD loss as:
(i.e., subgraphs of the supernet) during pre-training
and fine-tuning to make acceleration. Each candi- M
X
date model is sampled uniformly. Compared with L= (Lm m
attn + Lhidn ) + Lembd + γLpred , (4)
the first two search stages, the search cost is dra- m=1
matically reduced, enabling us to leverage more
where M is the number of Transformer layers in
downstream datasets to enhance the model trans-
the student model, γ controls the weight of the
ferability. Inspired by MT-DNN (Liu et al., 2019),
prediction loss Lpred .
each candidate model is fine-tuned in a multi-task
manner on different categories of downstream tasks. 4 Experiment
The weights of the embedding and Transformer
layers for all tasks are shared, while those of the This section demonstrates the superior performance
prediction layers are different. and transferability of our EfficientBERT on a wide
range of downstream tasks.
4.1 Datasets
Warm-up KD Formulations. In our warm-up We evaluate our model on two standard bench-
KD, each candidate/retrained model initially inher- marks for natural language understanding, i.e.,
its the weights from a warmed-up supernet. We the General Language Understanding Evaluation
use BERTBASE (Devlin et al., 2019) as the teacher (GLUE) benchmark (Wang et al., 2018) and the
model. Following TinyBERT (Jiao et al., 2020), we Stanford Question Answering Dataset (SQuAD).
jointly distill the attention matrices, Transformer- The GLUE benchmark contains nine classifica-
layer outputs, embeddings, and predicted logits tion datasets, including MNLI (Williams et al.,
between the student and teacher models. In detail, 2018), QQP (Chen et al., 2018), QNLI (Rajpurkar
the attention loss at the m-th student layer Lm
attn is et al., 2016), SST-2 (Socher et al., 2013), CoLA
Table 2: Results on the test set of GLUE benchmark. The architectures of different models are as follows.
BERTTINY & TinyBERT4 : (M =4, d=312, di =1200); BERTSMALL : (M =4, d=512, di =2048); BERT-PKD4 &
DistilBERT4 : (M =4, d=768, di =3072); BERT-PKD6 & DistilBERT6 : (M =6, d=768, di =3072). The latency is the
average inference time over 100 runs on a single GPU with a batch size of 128.
Model #Params Latency MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
BERTBASE (Google) 108.9M 362ms 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERTBASE (Teacher) 108.9M 362ms 84.8/83.8 71.6 91.3 93.1 53.9 85.3 89.2 68.9 80.2
BERTTINY (Turc et al., 2019) 14.5M 43ms 75.4/74.9 66.5 84.8 87.6 19.5 77.1 83.2 62.6 70.2
BERTSMALL (Turc et al., 2019) 28.8M 75ms 77.6/77.0 68.1 86.4 89.7 27.8 77.0 83.4 61.8 72.1
BERT-PKD4 (Sun et al., 2019) 52.8M 129ms 79.9/79.3 70.2 85.1 89.4 24.8 79.8 82.6 62.3 72.6
BERT-PKD6 (Sun et al., 2019) 67.0M 193ms 81.5/81.0 70.7 89.0 92.0 43.5 81.6 85.0 65.5 76.6
DistilBERT4 (Sanh et al., 2020) 52.8M 129ms 78.9/78.0 68.5 85.2 91.4 32.8 76.1 82.4 54.1 71.9
DistilBERT6 (Sanh et al., 2020) 67.0M 193ms 82.6/81.3 70.1 88.9 92.5 49.0 81.3 86.9 58.4 76.8
TinyBERT4 (Jiao et al., 2020) 14.5M 43ms 81.8/80.7 69.6 87.7 91.2 27.2 83.0 88.5 64.9 75.0
MobileBERTTINY (Sun et al., 2020) 15.1M 96ms 81.5/81.6 68.9 89.5 91.7 46.7 80.1 87.9 65.1 77.0
EfficientBERTTINY 9.4M 52ms 82.4/81.0 70.3 88.5 91.2 37.5 80.9 87.8 64.6 76.0
EfficientBERT w/o Warm-up KD 15.7M 83ms 83.1/82.0 71.0 89.5 90.8 42.1 82.1 88.4 67.2 77.4
EfficientBERT 15.7M 83ms 83.3/82.3 71.0 90.2 92.1 43.8 82.9 88.2 65.7 77.7
EfficientBERT+ 15.7M 83ms 83.0/82.3 71.2 89.3 92.4 38.1 85.1 89.9 69.4 77.9
EfficientBERT++ 16.0M 103ms 83.0/82.5 71.2 90.6 92.3 42.5 83.6 88.9 67.8 78.0
Table 3: Results on the dev set of GLUE benchmark compared with other NAS methods. † indicates the results
with data augmentation.
Model #Params MNLI-m QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
AdaBERT (Chen et al., 2020) † 6.4∼9.5M 81.3 70.5 87.2 91.9 - - 84.7 64.1 -
NAS-BERT10 (Xu et al., 2021) 10M 76.4 88.5 86.3 88.6 34.0 84.8 79.1 66.6 75.5
NAS-BERT30 (Xu et al., 2021) 30M 81.0 90.2 88.4 90.5 48.7 87.6 84.6 71.8 80.3
EfficientBERTTINY 9.4M 81.7 86.7 89.3 90.1 39.1 79.9 90.1 63.2 77.5
EfficientBERT 15.7M 83.1 87.3 90.4 91.3 50.2 82.5 91.5 66.8 80.4
(Warstadt et al., 2019), STS-B (Cer et al., 2017), Table 4: Results on the SQuAD dev datasets. The archi-
MRPC (Dolan and Brockett, 2005), RTE (Ben- tectures of MiniLM4 and MiniLM6 are (M =4, d=384,
di =1536) and (M =6, d=384, di =1536), respectively. †
tivogli et al., 2009), and WNLI (Levesque et al.,
indicates the results with data augmentation.
2011). The SQuAD task aims to predict the an-
swer text span of the given question in a Wikipedia Model #Params
SQuAD v1.1 SQuAD v2.0
EM/F1 EM/F1
passage, which contains two datasets: SQuAD BERTBASE (Google) 108.9M 80.8/88.5 -/-
v1.1 (Rajpurkar et al., 2016) and SQuAD v2.0 (Ra- BERTBASE (Teacher) 108.9M 80.5/88.2 74.8/77.7
BERT-PKD4 (Sun et al., 2019) 52.8M 70.1/79.5 60.8/64.6
jpurkar et al., 2018). The metrics can be found in BERT-PKD6 (Sun et al., 2019) 67.0M 77.1/85.3 66.3/69.8
DistilBERT4 (Sanh et al., 2020) 52.8M 71.8/81.2 60.6/64.1
Wang et al. (2018) and Rajpurkar et al. (2016). DistilBERT6 (Sanh et al., 2020) 67.0M 78.1/86.2 66.0/69.5
TinyBERT4 (Jiao et al., 2020) † 14.5M 72.7/82.1 68.2/71.8
4.2 Model Settings MiniLM4 (Wang et al., 2020b) 19.3M -/- -/69.7
MiniLM6 (Wang et al., 2020b) 22.9M -/- -/72.7
EfficientBERTTINY 9.4M 74.8/83.6 68.6/71.9
The embedding factorization strategy of our base EfficientBERT 15.7M 77.0/85.3 71.4/74.5
model is the same as MobileBERT (Sun et al., EfficientBERT++ 16.0M 78.3/86.5 73.0/76.1
Model (Pre-train Dataset) #Params MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
Base Model (Wiki) 15.3M 82.5/81.6 71.0 89.0 91.4 37.3 82.1 86.1 65.8 76.3
Search Stage 1 (Wiki) 15.4M 82.8/82.0 71.0 89.7 91.8 37.4 82.2 87.7 65.3 76.7
Search Stage 2 (Wiki) 15.4M 82.8/82.3 70.9 89.8 92.2 38.3 82.1 88.5 65.7 77.0
Search Stage 2 (Wiki+Books) 15.4M 82.8/82.0 71.1 89.7 92.1 42.5 82.2 88.2 66.3 77.4
EfficientBERT (Wiki+Books) 15.7M 83.3/82.3 71.0 90.2 92.1 43.8 82.9 88.2 65.7 77.7
Table 6: Effectiveness comparison between single- and linear decay of the learning rate.
stage searching and our coarse-to-fine NAS method.
4.4 Results on GLUE
Method Best Score #Searched Arch Search Cost
Single stage 76.7 2,700 84 GPU days We compare our searched models with BERTTINY ,
Search stage 1 76.7 1,900 54 GPU days
Search stage 1, 2 77.0 2,000 56 GPU days BERTSMALL (Turc et al., 2019) and several state-
Coarse-to-Fine NAS 77.7 5,000 58 GPU days of-the-art compressed BERT models, including
BERT-PKD (Sun et al., 2019), DistilBERT (Sanh
et al., 2020), TinyBERT4 (Jiao et al., 2020), and
4.3 Implementation Details
MobileBERTTINY (Sun et al., 2020). For a fair
In the first two search stages, the frozen super- comparison, TinyBERT4 is re-implemented by re-
net sliced by each candidate model is pre-trained moving the data augmentation and fine-tuning from
for ten epochs, and we use 2% of the English the official general distillation weights2 . The exper-
Wikipedia corpus to pre-train each candidate model imental results on the test set of GLUE benchmark
for one epoch. During fine-tuning, we use the first are listed in Table 2 and Figure 1.
10% training set of MNLI to train each model for From the results in Table 2, we can observe that:
three epochs and the last 1% training set for eval- (1) Our EfficientBERT is 6.9× smaller and 4.4×
uation. In the third search stage, the activated su- faster than BERTBASE and has achieved a com-
pernet is pre-trained and fine-tuned for ten epochs, petitive average GLUE score of 77.7, which is 0.7
and each candidate model is optimized for one step. higher than its counterpart MobileBERTTINY . (2)
We use the entire corpora of English Wikipedia and Our EfficientBERT+ has better transferability than
BooksCorpus as the pre-training data, the combina- EfficientBERT across different GLUE tasks with an
tion of 90% training set of each downstream GLUE improvement of 0.2 on the average score, demon-
task as the fine-tuning data, and the rest 10% train- strating the effectiveness of our multi-task training
ing set of MNLI as the evaluation data. The batch strategy in the third search stage. (3) Our Efficient-
size at each search stage is set to 256. The learning BERT++ has achieved state-of-the-art performance,
rates for pre-training and fine-tuning at each stage which outperforms MobileBERTTINY by 1.0 on
are set to 1e-4 and 4e-4, respectively. the average score. (4) Our EfficientBERTTINY out-
During retraining, each searched model is first performs TinyBERT4 by a 1.0 average score with
pre-trained for ten epochs based on the inherited fewer parameters and similar latency. (5) With-
weights from the warmed-up supernet and is then out our warm-up KD during retraining, i.e., pre-
fine-tuned on downstream tasks for ten epochs ex- training the model from scratch rather than from
cept for CoLA. Note that CoLA is fine-tuned for the warmed-up supernet, the average score of Ef-
50 epochs following the widely-used protocol. The ficientBERT decreases by 0.3, demonstrating the
batch sizes for pre-training and fine-tuning are set advantage of retraining with our warm-up KD. And
to 256 and 32, respectively. The learning rate for from the results in Figure 1, we can see that all of
pre-training is set to 1e-4. The learning rates for our searched models outperform other compared
fine-tuning on GLUE and SQuAD datasets are set models with similar or lower latency.
to 5e-5 and 1e-4, respectively. Furthermore, to verify the effectiveness of our
In all of our experiments, γ is set to 0 and 1 proposed NAS method, we compare with several
for pre-training and fine-tuning, respectively. t is related NAS methods on the GLUE dev set, includ-
set to 1. The maximum sequence length is set to 2
We use the 2nd version from https://github.com/
128. We use Adam with β1 = 0.9, β2 = 0.999, L2 huawei-noah/Pretrained-Language-Model/
weight decay of 0.01, warm-up proportion of 0.1, tree/master/TinyBERT
Table 7: Results of our EfficientBERT with different base models on the GLUE test set.
Model (Base Model) #Params MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
TinyBERT6 67.0M 83.8/83.2 71.4 89.8 92.0 38.8 83.1 89.0 65.8 77.4
EfficientBERT (TinyBERT6 ) 70.1M 84.1/83.2 71.4 90.4 92.6 46.2 83.7 89.0 67.7 78.7
Loss
82.5 87.1 90.0
=0.89 86.8 =0.91 89.4 =0.89 5
82.0
74 75 76 74 75 76 74 75 76 0 200 400 600 800
Step
SST-2 CoLA STS-B
Final Score
Victor Sanh, Lysandre Debut, Julien Chaumond, and Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
Thomas Wolf. 2020. Distilbert, a distilled version Yang, and Ming Zhou. 2020b. Minilm: Deep
of bert: smaller, faster, cheaper and lighter. arXiv self-attention distillation for task-agnostic compres-
preprint arXiv:1910.01108. sion of pre-trained transformers. arXiv preprint
arXiv:2002.10957.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
Yao, Amir Gholami, Michael W. Mahoney, and Kurt Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
Keutzer. 2020. Q-bert: Hessian based ultra low pre- man. 2019. Neural network acceptability judgments.
cision quantization of bert. In Proceedings of the Transactions of the Association for Computational
AAAI Conference on Artificial Intelligence, pages Linguistics, pages 625–641.
8815–8821.
David R. So, Quoc V. Le, and Chen Liang. 2019. The Adina Williams, Nikita Nangia, and Samuel Bowman.
evolved transformer. In Proceedings of the 36th In- 2018. A broad-coverage challenge corpus for sen-
ternational Conference on Machine Learning, pages tence understanding through inference. In Proceed-
5877–5886. ings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational
Richard Socher, Alex Perelygin, Jean Wu, Jason Linguistics: Human Language Technologies, pages
Chuang, Christopher D. Manning, Andrew Ng, and 1112–1122.
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree- Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian
bank. In Proceedings of the 2013 Conference on Li, Tao Qin, and Tie-Yan Liu. 2021. Nas-bert:
Empirical Methods in Natural Language Processing, Task-agnostic and adaptive-size bert compression
pages 1631–1642. with neural architecture search. arXiv preprint
arXiv:2105.14444.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.
Patient knowledge distillation for BERT model com- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
pression. In Proceedings of the 2019 Conference on bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Empirical Methods in Natural Language Processing Xlnet: Generalized autoregressive pretraining for
and the 9th International Joint Conference on Natu- language understanding. In Advances in Neural In-
ral Language Processing, pages 4323–4332. formation Processing Systems, pages 5753–5763.
LeakyReLU LeakyReLU
GeLU GeLU
Add & Norm
LeakyReLU
2 2
MHA MHA
Add & Norm
GeLU
Add & Norm Add & Norm
GeLU 2
ReLU ReLU
MHA
1 Mul Mul
Add & Norm
MHA
1 1
Add & Norm
GeLU MHA MHA
1 4 4 4
MHA MHA MHA MHA
(a) Base Model (b) Search Stage 1 (c) Search Stage 2 (d) EfficientBERT
Figure 6: Architectures of (a) our base model, (b)-(c) the searched models of the first two search stages, and (d)
our EfficientBERT.
(a) BERTBASE
(b) EfficientBERT
(c) TinyBERT6
Figure 7: Visualization for the attention distributions of (a) BERTBASE , (b) our EfficientBERT, (c) TinyBERT6 ,
and (d) our EfficientBERT (TinyBERT6 ) in the last Transformer layer.
(a) BERTBASE
(b) EfficientBERT
(c) TinyBERT6