0% found this document useful (0 votes)
21 views14 pages

Efficientbert: Progressively Searching Multilayer Perceptron Via Warm-Up Knowledge Distillation

Uploaded by

c1021313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Efficientbert: Progressively Searching Multilayer Perceptron Via Warm-Up Knowledge Distillation

Uploaded by

c1021313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

EfficientBERT: Progressively Searching Multilayer Perceptron via

Warm-up Knowledge Distillation


Chenhe Dong1 , Guangrun Wang2 , Hang Xu3 , Jiefeng Peng4 ,
Xiaozhe Ren3 , Xiaodan Liang1∗
1
Shenzhen Campus of Sun Yat-sen University 2 University of Oxford
3
Huawei Noah’s Ark Lab 4 DarkMatter AI Research
[email protected], {xu.hang,renxiaozhe}@huawei.com
{wanggrun,jiefengpeng,xdliang328}@gmail.com

Abstract 78 EfficientBERT++
EfficientBERT
MobileBERTTINY DistilBERT6
Pre-trained language models have shown re-
BERT-PKD6
markable results on various NLP tasks. Never- 76 EfficientBERTTINY
theless, due to their bulky size and slow infer-
arXiv:2109.07222v2 [cs.CL] 16 Sep 2021

Final Score
TinyBERT4
ence speed, it is hard to deploy them on edge
devices. In this paper, we have a critical in- 74
sight that improving the feed-forward network
BERT-PKD4
(FFN) in BERT has a higher gain than im- 72 BERTSMALL DistilBERT4
proving the multi-head attention (MHA) since
the computational cost of FFN is 2∼3 times
larger than MHA. Hence, to compact BERT, 70 BERTTINY
we are devoted to designing efficient FFN as 50 75 100 125 150 175 200
Latency (ms)
opposed to previous works that pay attention
to MHA. Since FFN comprises a multilayer
Figure 1: Final score vs. latency tradeoff curve. Final
perceptron (MLP) that is essential in BERT
score refers to the average score on the GLUE test set.
optimization, we further design a thorough
search space towards an advanced MLP and
perform a coarse-to-fine mechanism to search
for an efficient BERT architecture. Moreover, processing (NLP) tasks. Nevertheless, their short-
to accelerate searching and enhance model comings are still evident (including a considerable
transferability, we employ a novel warm- model size and low inference efficiency), limiting
up knowledge distillation strategy at each real-world application scenarios.
search stage. Extensive experiments show our To alleviate the aforementioned limitations,
searched EfficientBERT is 6.9× smaller and many model compression methods have been pro-
4.4× faster than BERTBASE , and has com-
posed, including quantization, weight pruning, and
petitive performances on GLUE and SQuAD
Benchmarks. Concretely, EfficientBERT at- knowledge distillation (KD) (Shen et al., 2020;
tains a 77.7 average score on GLUE test, 0.7 Michel et al., 2019; Jiao et al., 2020). Among
higher than MobileBERTTINY , and achieves them, KD (Hinton et al., 2015) that transfers the
an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 knowledge from larger teacher models to smaller
dev, 3.2/2.7 higher than TinyBERT4 even with- student models with minimal performance sacri-
out data augmentation. The code is released fice is most widely used due to its plug-and-play
at https://github.com/cheneydon/
feasibility and its scalability in the rapid delivery
efficient-bert.
of new models. Specifically, KD allows us to train
1 Introduction our own BERT architecture significantly faster than
training from scratch. Hence, we adopt KD in this
Diverse pre-trained language models (PLMs) (e.g., paper. Besides, inspired by the impressive results
BERT (Devlin et al., 2019)) have been intensively by neural architecture search (NAS) in vision tasks
investigated by designing new pretext tasks, ar- (Howard et al., 2019; Li et al., 2020; Cai et al.,
chitectures, or attention mechanisms (Yang et al., 2020), adopting NAS to further boost the perfor-
2019; Jiang et al., 2020; Beltagy et al., 2020). The mance of PLMs or reduce the computational cost
performances of these PLMs far exceed the tra- has attracted increasing attentions (So et al., 2019;
ditional methods on a variety of natural language Wang et al., 2020a; Chen et al., 2020).

Corresponding author. Although considerable progress has been made
in the field of KD for PLMs, the compression of is allowed, i.e., each model can inherit weights
the feed-forward network (FFN) has been rarely from this activated warmed-up supernet for a quick
studied. This contradicts the fact that the computa- launch and is then trained with weight sharing in a
tional cost of FFN is 2∼3 times larger than that of multi-task manner to enhance transferability.
the multi-head attention (MHA). In addition, Dong Extensive experimental results show that our
et al. (2021) have proved that the multilayer per- searched architecture, named EfficientBERT, is
ceptron (MLP) in FFN can prevent the undesirable 6.9× smaller and 4.4× faster than BERTBASE ,
rank collapse caused by self-attention and thus can and has competitive performance. On the test set
improve BERT optimization. These motivate us to of GLUE benchmark, EfficientBERT attains an
investigate the nonlinearity of FFN in BERT. average score of 77.7, which is 0.7 higher than
In this paper, we make the first attempt to com- MobileBERTTINY , and achieves an F1 score of
press and improve the barely-explored multilayer 85.3/74.5 on the SQuAD v1.1/v2.0 dev dataset,
perceptron (MLP) in FFN and propose a novel which is 3.2/2.7 higher than TinyBERT4 even with-
coarse-to-fine NAS approach with warm-up KD out data augmentation.
to find the optimal MLP architectures, aiming to
search for a universal small BERT model with 2 Related Work
competitive performance and strong transferabil-
ity. Specifically, we design a rich and flexible Compression for Pre-trained Language Models.
search space to discover an excellent FFN with For the past few years, pre-trained language models
maximal nonlinearity and a minimal computational (PLMs) have demonstrated their strong powers on
cost. Our search space contains various mathemati- a variety of NLP tasks with the trend of larger and
cal operations, stack numbers, and expansion ratios larger model size as well as better results. However,
of intermediate hidden size. To efficiently search it is hard to deploy them on resource-limited edge
from our vast search space, we progressively shrink devices for practical usage. To solve this problem,
the search space in three stages. many efficient PLMs have been proposed (Turc
et al., 2019; Lan et al., 2020). For example, Turc
• Stage 1: Perform a coarse search to explore the et al. (2019) directly pre-train and fine-tune smaller
entire search space (i.e., jointly searching the BERT models. In addition, many compression tech-
mathematical operations, stack numbers, and ex- niques for PLMs have been proposed recently to
pansion ratios)(Figure 2 (a)). reduce the training cost, including quantization,
• Stage 2: Fix the stack numbers and expansion weight pruning, and knowledge distillation (KD)
ratios, performing a fine-grained search for the (Shen et al., 2020; Sajjad et al., 2020; Jiao et al.,
optimal mathematical operations (Figure 2 (b)). 2020). Among them, KD (Hinton et al., 2015) is
• Stage 3: Fix the mathematical operations, per- widely used due to its plug-and-play feasibility,
forming a fine-grained search for optimal stack which aims to transfer the knowledge from larger
numbers and expansion ratios (Figure 2 (c)). teacher models to smaller student models without
Even with this elegant coarse-to-fine search strat- sacrificing too much performance. For example,
egy, pre-training each candidate model still needs a BERT-PKD (Sun et al., 2019) jointly distills the in-
lot of time to converge during searching. To solve termediate and last layers during fine-tuning. Distil-
this problem, different from the conventional KD BERT (Sanh et al., 2020) distills the last layers with
strategy (Jiao et al., 2020), we propose a warm-up a triple loss during pre-training. MobileBERT (Sun
KD strategy to fast transfer the knowledge, where et al., 2020) designs an inverted-bottleneck model
a pre-trained supernet is additionally introduced to structure and progressively transfers the knowl-
perform a joint warm-up for all candidate models. edge during pre-training. MiniLM (Wang et al.,
Note that the warm-up strategy in the third stage is 2020b) performs a deep self-attention distillation
slightly different from that of the first two stages. during pre-training. TinyBERT (Jiao et al., 2020)
During the first two stages, each candidate model introduces a comprehensive Transformer distilla-
initially inherits its weights from a frozen warmed- tion method during pre-training and fine-tuning.
up supernet to accelerate searching. But in the third Nevertheless, the compression of the feed-
stage, since there is no need to search mathemat- forward network (FFN) has not been well studied,
ical operations, an unfrozen warmed-up supernet although its computational cost is 2∼3 larger than
sharing weights across different candidate models the multi-head attention (MHA) as pointed out by
Search Space (Search stage 1)
Warm-up KD (Search stage 1, 2)
FFN mathematical FFN stack FFN intermediate Add & Norm
operation number expansion ratio Search Space Mathematical Operations MNLI

Search Space (Search stage 2) Frozen supernet Candidate subnet Single-task dataset
FFN mathematical operation
Add & Norm
Search Space (Search stage 3) …
Mathematical Operations
FFN stack number FFN intermediate expansion ratio Linear Operations
MNLI RTE
Activated supernet Candidate subnet Multi-task dataset
Classifier Classifier

Warm-up KD (Search stage 3)


Add & Norm … …
Mathematical Operations Add & Norm
FFN Warm-up KD FFN Linear Operations
Mathematical Operations
GeLU
MHA MHA

1 … …
Linear Operations
GeLU
x
Linear Operations
Embedding Embedding
Student Model Mathematical Operations Mathematical Operations
Teacher Model
Max
Activated operations Unused operations
Tanh
Activated channels Unused channels L L L
L Activated stack number L Frozen stack number x x x
Activated path Frozen path (a) Search Stage 1 (b) Search Stage 2 (c) Search Stage 3

Figure 2: An overview of the search procedure of our EfficientBERT. The teacher model is BERTBASE (left), and
the search space of our student model is designed towards achieving better nonlinearity of FFN, which contains
mathematical operations, stack numbers, and intermediate expansion ratios (right). During searching, we progres-
sively shrink the search space and divide the search process into three stages to conduct NAS ((a)-(c)). Both in
the search and retraining stages, we use a novel warm-up knowledge distillation to transfer the teacher model’s
knowledge (middle). Specifically, each candidate or retrained subnet first inherits the weights from a frozen or
activated warmed-up supernet, then conducts pre-training and fine-tuning with single-task or multi-task dataset.

Iandola et al. (2020). In contrast, compressing FFN Table 1: Mathematical operations of FFN. Following
is our main focus in this work. the original papers of GeLU (Hendrycks and Gimpel,
2016) andp Leaky ReLU (He et al., 2015), we let c1 =
Neural Architecture Search. Motivated by the 0.5, c2 = 2/π, c3 = 0.044715, c4 = 0.01.
success of neural architecture search (NAS) in com-
Operation Expression Arity
puter vision (Howard et al., 2019; Li et al., 2020;
Cai et al., 2020), increasing attention has been paid Add x+y 2
Mul x×y 2
to applying NAS to NLP tasks (So et al., 2019; Max max(x, y) 2
Wang et al., 2020a; Chen et al., 2020), aiming to GeLU c1 x(1 + tanh(c2 (x + c3 x3 ))) 1
automatically search for optimal architectures from Sigmoid 1/(1 + e−x ) 1
Tanh (e − e−x )/(ex + e−x )
x
1
a vast search space. Evolved Transformer (So et al., ReLU max(x, 0) 1
2019) employs NAS to search for a better Trans- Leaky ReLU x if x ≥ 0 else c4 x 1
former architecture with an evolutionary algorithm. ELU x if x ≥ 0 else ex − 1 1
Swish x/(1 + e−x ) 1
HAT (Wang et al., 2020a) applies NAS to search
for efficient hardware-aware Transformer models
based on a Transformer supernet. AdaBERT (Chen
for a small universal BERT with competitive per-
et al., 2020) searches for task-adaptive small mod-
formance and strong transferability. And unlike
els with KD and differentiable NAS method. NAS-
NAS-BERT, we design a much more flexible search
BERT (Xu et al., 2021) proposes a task-agnostic
space and use warm-up KD with a coarse-to-fine
NAS method for adaptive-size model compression,
searching paradigm to accelerate searching and en-
where several acceleration techniques (including
hance model transferability.
block-wise search, search space pruning, and per-
formance approximation) are introduced to speed 3 Our EfficientBERT
up the searching process.
Differently, in this paper, we design a compre- We aim at discovering a lightweight MLP architec-
hensive search space towards the nonlinearity of ture with better nonlinearity in each FFN layer, en-
multilayer perceptron (MLP) in FFN, and pro- suring the searched model can achieve compelling
pose a novel coarse-to-fine NAS approach with performance. We first present the search space
warm-up KD to find the optimal MLP architec- towards better nonlinearity of MLP in FFN, as de-
tures. Unlike AdaBERT, we apply NAS to search scribed in Section 3.1. Then we propose a novel
coarse-to-fine NAS method with warm-up KD as operations are optionally placed in the intermediate
discussed in Section 3.2. nodes to process the hidden states. More details of
our search space can be found in Figure 2.
3.1 Search Space Design
3.2 Neural Architecture Search
In a standard Transformer layer, there are two main
Base Model Design. As discussed by previous
components: a multi-head attention (MHA) and
BERT compression works (Jiao et al., 2020; Sun
a feed-forward network (FFN). Theoretically, the
et al., 2020), there are several strategies to reduce
computation (Mult-Adds) of MHA and FFN is
the model size, including the embedding factoriza-
O(4Ld2 + L2 d) and O(2 × 4Ld2 ) respectively,
tion and model width/depth reduction. However,
where L is the sequence length and d is the chan-
most of the recent works only consider part of these
nel number. As d gets larger, the computation of
strategies. In our work, we design a base model
FFN gets larger than MHA. And as pointed out by
with all these strategies to make a comprehensive
previous works (Iandola et al., 2020), the latency
compression. Besides, we find that the expansion
of MHA and FFN in each layer of BERTBASE ac-
ratio of intermediate hidden size in FFN contributes
counts for about 30% and 70% on a Google Pixel 3
a lot to the model size and inference latency. Thus
smartphone, and the parameter numbers for MHA
the reduction of the intermediate expansion ratio is
and FFN are about 2.4M and 4.7M, respectively.
also considered. The detailed settings of our base
These demonstrate the potential of compressing
model can be found in Section 4.2.
FFN, i.e., compressing FFN may be more promis-
ing than squeezing MHA. In addition, as discussed Coarse-to-Fine NAS with Warm-up KD. To
by Dong et al. (2021), the MLP in FFN can prevent speed up the search in the vast search space, we
an optimization problem, i.e., rank collapse, caused propose a coarse-to-fine NAS method by progres-
by self-attention; thus, the nonlinearity ability of sively shrinking the search space. The search pro-
FFN deserves to be investigated. Hence, our main cess is divided into three stages where a coarse-
focus is compression and improvement of FFN. grained search is conducted in the first stage to
We then design a search space towards the non- jointly search all of the factors in our search space,
linearity of MLP in FFN to search for a model and fine-grained searches are conducted in the last
with better nonlinearity of FFN and increase the two stages to search for partial factors.
performance. Many factors determine the FFN non- In the first search stage, we jointly search all of
linearity, such as the mathematical operations and the factors in our search space, including the math-
the expansion ratios of intermediate hidden size. ematical operations, stack numbers, and intermedi-
Inspired by MobileBERT (Sun et al., 2020), we ate expansion ratios. We use a DAG computation
find that by increasing the stack number of FFN, graph described in §3.1 to represent each MLP ar-
the model performance can also be remarkably im- chitecture. The initial search candidates are based
proved. We integrate all of the above factors into on our base model, but different stack numbers and
our search space, including the mathematical oper- intermediate expansion ratios of FFN are allowed.
ations, stack numbers, and intermediate expansion During searching, each candidate model is first
ratios of FFN. (1) Mathematical operation: We sampled by a learnable sampling decision tree as
define some primitive operations (including several proposed in LaNAS (Wang et al., 2019). Then
binary aggregation functions and unary activation warm-up KD is employed on each candidate model
functions) and search their different combinations, to accelerate the search process. Since we need
as shown in Table 1. (2) Stack number: The stack to search for the mathematical operations, we can-
number of FFN is selected from {1, 2, 3, 4}. (3) not share the weights of different candidate models
Intermediate expansion ratio: The intermediate ex- to avoid the potential interference problems. In-
pansion ratio is selected from {1, 1/2, 1/3, 1/4}. stead, we first build a warmed-up supernet based
Note that the stack number and the intermediate ex- on our base model with the maximum FFN stack
pansion ratio are jointly considered to balance the number and intermediate expansion ratio in our
computation cost, e.g., network parameters. We use search space. The supernet is pre-trained entirely
a directed acyclic graph (DAG) to represent each (i.e., complete graph) with KD. The weights of the
FFN architecture when searching the mathematical supernet are then frozen. When training each can-
operations. The mathematical operations and linear didate model, we first inherit its weights from the
supernet. Precisely, the weights of each stacked calculated by the mean square error (MSE) loss as:
FFN are sliced from bottom to top layer; and the
h
weights of each linear operation are sliced from 1X
Lm
attn = MSE(ASi,m , ATi,n ), (1)
left to right channel. After that, we only need to h
i=1
pre-train and fine-tune each model for a few steps
via KD to adjust the inherited weights. This signifi- where ASi,m and ATi,n refer to the i-th head of atten-
cantly reduces the search cost. tion matrices at m-th student layer and its matching
n-th teacher layer, respectively, and h is the number
In the second search stage, to discover more
of attention heads. The Transformer-layer output
diversified mathematical operations and evaluate
loss at the m-th student layer Lm hidn and the embed-
their effects, we search them individually with the
ding loss Lembd can be formulated as:
same method in the first search stage. The ini-
tial search candidates are built upon the searched ( m
Lhidn = MSE(HSm Wh , HTn )
model of the first search stage (i.e., we fix the stack , (2)
numbers and expansion ratios). The sampling and Lembd = MSE(ES We , ET )
KD strategies are the same as the first search stage.
where HSm and HTn are the Transformer-layer out-
In the third search stage, we jointly search the puts at m-th student layer and its matching n-th
stack numbers and intermediate expansion ratios in teacher layer, respectively. E is the embedding,
the search space to explore their potentials further. and two learnable transformation matrices Wh and
The initial search candidates are based on the sec- We are applied to align the mismatch dimensions
ond search stage’s searched model; the searched between the student and teacher models. More-
mathematical operations are fixed, but different over, the prediction loss Lpred calculated by the
stack numbers and intermediate expansion ratios soft cross-entropy (CE) loss can be formulated as:
of FFN are allowed. We also apply warm-up KD
to accelerate the searching. Specifically, we first Lpred = CE(zS /t, zT /t), (3)
warm up the supernet entirely (i.e., complete graph)
where z is the predicted logits vector, and t is the
via KD again but do not freeze its weights. Then
temperature value. Finally, we combine all of the
we share the weights of different candidate models
above losses and derive the overall KD loss as:
(i.e., subgraphs of the supernet) during pre-training
and fine-tuning to make acceleration. Each candi- M
X
date model is sampled uniformly. Compared with L= (Lm m
attn + Lhidn ) + Lembd + γLpred , (4)
the first two search stages, the search cost is dra- m=1
matically reduced, enabling us to leverage more
where M is the number of Transformer layers in
downstream datasets to enhance the model trans-
the student model, γ controls the weight of the
ferability. Inspired by MT-DNN (Liu et al., 2019),
prediction loss Lpred .
each candidate model is fine-tuned in a multi-task
manner on different categories of downstream tasks. 4 Experiment
The weights of the embedding and Transformer
layers for all tasks are shared, while those of the This section demonstrates the superior performance
prediction layers are different. and transferability of our EfficientBERT on a wide
range of downstream tasks.

4.1 Datasets
Warm-up KD Formulations. In our warm-up We evaluate our model on two standard bench-
KD, each candidate/retrained model initially inher- marks for natural language understanding, i.e.,
its the weights from a warmed-up supernet. We the General Language Understanding Evaluation
use BERTBASE (Devlin et al., 2019) as the teacher (GLUE) benchmark (Wang et al., 2018) and the
model. Following TinyBERT (Jiao et al., 2020), we Stanford Question Answering Dataset (SQuAD).
jointly distill the attention matrices, Transformer- The GLUE benchmark contains nine classifica-
layer outputs, embeddings, and predicted logits tion datasets, including MNLI (Williams et al.,
between the student and teacher models. In detail, 2018), QQP (Chen et al., 2018), QNLI (Rajpurkar
the attention loss at the m-th student layer Lm
attn is et al., 2016), SST-2 (Socher et al., 2013), CoLA
Table 2: Results on the test set of GLUE benchmark. The architectures of different models are as follows.
BERTTINY & TinyBERT4 : (M =4, d=312, di =1200); BERTSMALL : (M =4, d=512, di =2048); BERT-PKD4 &
DistilBERT4 : (M =4, d=768, di =3072); BERT-PKD6 & DistilBERT6 : (M =6, d=768, di =3072). The latency is the
average inference time over 100 runs on a single GPU with a batch size of 128.

Model #Params Latency MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
BERTBASE (Google) 108.9M 362ms 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERTBASE (Teacher) 108.9M 362ms 84.8/83.8 71.6 91.3 93.1 53.9 85.3 89.2 68.9 80.2
BERTTINY (Turc et al., 2019) 14.5M 43ms 75.4/74.9 66.5 84.8 87.6 19.5 77.1 83.2 62.6 70.2
BERTSMALL (Turc et al., 2019) 28.8M 75ms 77.6/77.0 68.1 86.4 89.7 27.8 77.0 83.4 61.8 72.1
BERT-PKD4 (Sun et al., 2019) 52.8M 129ms 79.9/79.3 70.2 85.1 89.4 24.8 79.8 82.6 62.3 72.6
BERT-PKD6 (Sun et al., 2019) 67.0M 193ms 81.5/81.0 70.7 89.0 92.0 43.5 81.6 85.0 65.5 76.6
DistilBERT4 (Sanh et al., 2020) 52.8M 129ms 78.9/78.0 68.5 85.2 91.4 32.8 76.1 82.4 54.1 71.9
DistilBERT6 (Sanh et al., 2020) 67.0M 193ms 82.6/81.3 70.1 88.9 92.5 49.0 81.3 86.9 58.4 76.8
TinyBERT4 (Jiao et al., 2020) 14.5M 43ms 81.8/80.7 69.6 87.7 91.2 27.2 83.0 88.5 64.9 75.0
MobileBERTTINY (Sun et al., 2020) 15.1M 96ms 81.5/81.6 68.9 89.5 91.7 46.7 80.1 87.9 65.1 77.0
EfficientBERTTINY 9.4M 52ms 82.4/81.0 70.3 88.5 91.2 37.5 80.9 87.8 64.6 76.0
EfficientBERT w/o Warm-up KD 15.7M 83ms 83.1/82.0 71.0 89.5 90.8 42.1 82.1 88.4 67.2 77.4
EfficientBERT 15.7M 83ms 83.3/82.3 71.0 90.2 92.1 43.8 82.9 88.2 65.7 77.7
EfficientBERT+ 15.7M 83ms 83.0/82.3 71.2 89.3 92.4 38.1 85.1 89.9 69.4 77.9
EfficientBERT++ 16.0M 103ms 83.0/82.5 71.2 90.6 92.3 42.5 83.6 88.9 67.8 78.0

Table 3: Results on the dev set of GLUE benchmark compared with other NAS methods. † indicates the results
with data augmentation.

Model #Params MNLI-m QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
AdaBERT (Chen et al., 2020) † 6.4∼9.5M 81.3 70.5 87.2 91.9 - - 84.7 64.1 -
NAS-BERT10 (Xu et al., 2021) 10M 76.4 88.5 86.3 88.6 34.0 84.8 79.1 66.6 75.5
NAS-BERT30 (Xu et al., 2021) 30M 81.0 90.2 88.4 90.5 48.7 87.6 84.6 71.8 80.3
EfficientBERTTINY 9.4M 81.7 86.7 89.3 90.1 39.1 79.9 90.1 63.2 77.5
EfficientBERT 15.7M 83.1 87.3 90.4 91.3 50.2 82.5 91.5 66.8 80.4

(Warstadt et al., 2019), STS-B (Cer et al., 2017), Table 4: Results on the SQuAD dev datasets. The archi-
MRPC (Dolan and Brockett, 2005), RTE (Ben- tectures of MiniLM4 and MiniLM6 are (M =4, d=384,
di =1536) and (M =6, d=384, di =1536), respectively. †
tivogli et al., 2009), and WNLI (Levesque et al.,
indicates the results with data augmentation.
2011). The SQuAD task aims to predict the an-
swer text span of the given question in a Wikipedia Model #Params
SQuAD v1.1 SQuAD v2.0
EM/F1 EM/F1
passage, which contains two datasets: SQuAD BERTBASE (Google) 108.9M 80.8/88.5 -/-
v1.1 (Rajpurkar et al., 2016) and SQuAD v2.0 (Ra- BERTBASE (Teacher) 108.9M 80.5/88.2 74.8/77.7
BERT-PKD4 (Sun et al., 2019) 52.8M 70.1/79.5 60.8/64.6
jpurkar et al., 2018). The metrics can be found in BERT-PKD6 (Sun et al., 2019) 67.0M 77.1/85.3 66.3/69.8
DistilBERT4 (Sanh et al., 2020) 52.8M 71.8/81.2 60.6/64.1
Wang et al. (2018) and Rajpurkar et al. (2016). DistilBERT6 (Sanh et al., 2020) 67.0M 78.1/86.2 66.0/69.5
TinyBERT4 (Jiao et al., 2020) † 14.5M 72.7/82.1 68.2/71.8
4.2 Model Settings MiniLM4 (Wang et al., 2020b) 19.3M -/- -/69.7
MiniLM6 (Wang et al., 2020b) 22.9M -/- -/72.7
EfficientBERTTINY 9.4M 74.8/83.6 68.6/71.9
The embedding factorization strategy of our base EfficientBERT 15.7M 77.0/85.3 71.4/74.5
model is the same as MobileBERT (Sun et al., EfficientBERT++ 16.0M 78.3/86.5 73.0/76.1

2020), the number of Transformer layers M is set


to 6, the hidden size of the model d is set to 540,
and the intermediate expansion ratio of FFN is set layers by affinely repeating each layer in Efficient-
to 1 with intermediate hidden size di of 540. The BERT twice and shrink the hidden size from 540
remaining structures are the same as BERTBASE . to 360, forming EfficientBERT++. The weights are
We retrain our searched model of the third search initially inherited from the warmed-up supernet of
stage by employing the warm-up KD method used EfficientBERT in the same manner. In addition,
in the first two search stages described in Section to ensure a fair comparison with TinyBERT4 , we
3.2, referring to as EfficientBERT. EfficientBERT+ further shrink the hidden size of our EfficientBERT
is obtained by inheriting the weights of Efficient- from 540 to 360, forming EfficientBERTTINY ,
BERT from the multi-task fine-tuned supernet and which has similar latency with TinyBERT4 .1
then directly fine-tune on each downstream task.
Moreover, to verify the importance of model depth, 1
Our searched model, i.e., EfficientBERT, can be seen in
we extend our EfficientBERT from 6 layers to 12 Figure 6 of the Appendix A.
Table 5: Results of searched models at different search stages on the GLUE test set. Wiki and Books refer to the
pre-training corpora of English Wikipedia and BooksCorpus, respectively.

Model (Pre-train Dataset) #Params MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
Base Model (Wiki) 15.3M 82.5/81.6 71.0 89.0 91.4 37.3 82.1 86.1 65.8 76.3
Search Stage 1 (Wiki) 15.4M 82.8/82.0 71.0 89.7 91.8 37.4 82.2 87.7 65.3 76.7
Search Stage 2 (Wiki) 15.4M 82.8/82.3 70.9 89.8 92.2 38.3 82.1 88.5 65.7 77.0
Search Stage 2 (Wiki+Books) 15.4M 82.8/82.0 71.1 89.7 92.1 42.5 82.2 88.2 66.3 77.4
EfficientBERT (Wiki+Books) 15.7M 83.3/82.3 71.0 90.2 92.1 43.8 82.9 88.2 65.7 77.7

Table 6: Effectiveness comparison between single- and linear decay of the learning rate.
stage searching and our coarse-to-fine NAS method.
4.4 Results on GLUE
Method Best Score #Searched Arch Search Cost
Single stage 76.7 2,700 84 GPU days We compare our searched models with BERTTINY ,
Search stage 1 76.7 1,900 54 GPU days
Search stage 1, 2 77.0 2,000 56 GPU days BERTSMALL (Turc et al., 2019) and several state-
Coarse-to-Fine NAS 77.7 5,000 58 GPU days of-the-art compressed BERT models, including
BERT-PKD (Sun et al., 2019), DistilBERT (Sanh
et al., 2020), TinyBERT4 (Jiao et al., 2020), and
4.3 Implementation Details
MobileBERTTINY (Sun et al., 2020). For a fair
In the first two search stages, the frozen super- comparison, TinyBERT4 is re-implemented by re-
net sliced by each candidate model is pre-trained moving the data augmentation and fine-tuning from
for ten epochs, and we use 2% of the English the official general distillation weights2 . The exper-
Wikipedia corpus to pre-train each candidate model imental results on the test set of GLUE benchmark
for one epoch. During fine-tuning, we use the first are listed in Table 2 and Figure 1.
10% training set of MNLI to train each model for From the results in Table 2, we can observe that:
three epochs and the last 1% training set for eval- (1) Our EfficientBERT is 6.9× smaller and 4.4×
uation. In the third search stage, the activated su- faster than BERTBASE and has achieved a com-
pernet is pre-trained and fine-tuned for ten epochs, petitive average GLUE score of 77.7, which is 0.7
and each candidate model is optimized for one step. higher than its counterpart MobileBERTTINY . (2)
We use the entire corpora of English Wikipedia and Our EfficientBERT+ has better transferability than
BooksCorpus as the pre-training data, the combina- EfficientBERT across different GLUE tasks with an
tion of 90% training set of each downstream GLUE improvement of 0.2 on the average score, demon-
task as the fine-tuning data, and the rest 10% train- strating the effectiveness of our multi-task training
ing set of MNLI as the evaluation data. The batch strategy in the third search stage. (3) Our Efficient-
size at each search stage is set to 256. The learning BERT++ has achieved state-of-the-art performance,
rates for pre-training and fine-tuning at each stage which outperforms MobileBERTTINY by 1.0 on
are set to 1e-4 and 4e-4, respectively. the average score. (4) Our EfficientBERTTINY out-
During retraining, each searched model is first performs TinyBERT4 by a 1.0 average score with
pre-trained for ten epochs based on the inherited fewer parameters and similar latency. (5) With-
weights from the warmed-up supernet and is then out our warm-up KD during retraining, i.e., pre-
fine-tuned on downstream tasks for ten epochs ex- training the model from scratch rather than from
cept for CoLA. Note that CoLA is fine-tuned for the warmed-up supernet, the average score of Ef-
50 epochs following the widely-used protocol. The ficientBERT decreases by 0.3, demonstrating the
batch sizes for pre-training and fine-tuning are set advantage of retraining with our warm-up KD. And
to 256 and 32, respectively. The learning rate for from the results in Figure 1, we can see that all of
pre-training is set to 1e-4. The learning rates for our searched models outperform other compared
fine-tuning on GLUE and SQuAD datasets are set models with similar or lower latency.
to 5e-5 and 1e-4, respectively. Furthermore, to verify the effectiveness of our
In all of our experiments, γ is set to 0 and 1 proposed NAS method, we compare with several
for pre-training and fine-tuning, respectively. t is related NAS methods on the GLUE dev set, includ-
set to 1. The maximum sequence length is set to 2
We use the 2nd version from https://github.com/
128. We use Adam with β1 = 0.9, β2 = 0.999, L2 huawei-noah/Pretrained-Language-Model/
weight decay of 0.01, warm-up proportion of 0.1, tree/master/TinyBERT
Table 7: Results of our EfficientBERT with different base models on the GLUE test set.

Model (Base Model) #Params MNLI-m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg
TinyBERT6 67.0M 83.8/83.2 71.4 89.8 92.0 38.8 83.1 89.0 65.8 77.4
EfficientBERT (TinyBERT6 ) 70.1M 84.1/83.2 71.4 90.4 92.6 46.2 83.7 89.0 67.7 78.7

MNLI-m QQP QNLI w/o Warm-up KD


83.0 87.4 90.6 10 Warm-up KD

Loss
82.5 87.1 90.0
=0.89 86.8 =0.91 89.4 =0.89 5
82.0
74 75 76 74 75 76 74 75 76 0 200 400 600 800
Step
SST-2 CoLA STS-B
Final Score

91.5 48.0 81.6 Figure 4: Efficiency comparison results between


91.0 42.0 81.0 searching with and without our warm-up KD.

90.5 =0.23 36.0 =1.00 80.4 =0.62


74 75 76 74 75 76 74 75 76 4.6 Discussion
MRPC RTE Avg
90.0 65.0 79.2 Effectiveness of Coarse-to-Fine NAS Method.
89.8 64.5 To measure the effectiveness of our coarse-to-fine
78.4
NAS method, we first compare the performances
89.5 =0.04 64.0 =0.69 77.6 =1.00 of the searched models at different search stages
74 75 76 74 75 76 74 75 76
Search Score on the GLUE test set in Table 5. It can be observed
that the searched model in the first search stage
Figure 3: Results of model ranking correlation between has better performance than our base model, which
the search and retraining phases on the GLUE dev set
in the first search stage.
proves the effectiveness of the coarse-grained NAS
process. And from the first to the third search
stages, the performances of the searched models
ing AdaBERT (Chen et al., 2020) and NAS-BERT are gradually enhanced, which shows the effective-
(Xu et al., 2021). The results are shown in Table ness of the fine-grained strategies and the necessity
3. As can be seen, with similar parameters, our of each factor in our search space.
EfficientBERTTINY has better performance than Then we compare the effectiveness between
AdaBERT and NAS-BERT10 ; and our Efficient- single-stage searching and our coarse-to-fine NAS
BERT outperforms NAS-BERT30 even with much method in Table 6. As shown, our coarse-to-fine
fewer parameters. These results demonstrate the NAS method has higher efficiency than single-
superiority of our NAS method. stage searching, saving 26 GPU days. It can also
search for 2,300 more architectures and observe
4.5 Results on SQuAD better architecture with a higher GLUE test score.
To measure the transferability of our searched mod- Effectiveness of Warm-up KD. To evaluate the
els across different types of tasks, we further evalu- model ranking effectiveness of our warm-up KD
ate our models on SQuAD dev datasets, as shown method between the search and retraining phases,
in Table 4. We choose BERT-PKD, DistilBERT, we first randomly sample eight candidate models
TinyBERT4 , and MiniLM (Wang et al., 2020b) in the first search stage, whose search scores range
as the baseline models. From the results, we can from 77.6 to 79.3. Then we retrain each model
see that our EfficientBERT still achieves competi- and obtain its final score on the GLUE dev set, as
tive performances, which outperforms TinyBERT4 shown in Figure 3. The Kendall Tau τ (Kendall,
by 3.2/2.7 F1 score on SQuAD v1.1/v2.0 dev 1938) for each downstream task is also calculated.
dataset even without data augmentation, and sur- From the results, we can see that the search and re-
passes MiniLM6 by 1.8 F1 score on SQuAD v2.0 training phases have strong positive correlations on
dev dataset. Besides, our EfficientBERTTINY can most downstream tasks, demonstrating the strong
also outperform TinyBERT4 on both SQuAD dev ranking capability of the warm-up KD strategy.
datasets. These results indicate the strong perfor- Next, to test the efficiency, we compare the
mance and transferability of our searched models. fine-tuning losses of our base model in the first
-0.767 0.15 bias in each linear operation to 1 and 0, respec-
Z 0.10 Z -0.36 Z
-0.768 0.05 -0.39
-0.42
tively, in order to alleviate their impacts. From the
16 8 8
0 16 8 8
0 16 8 8
0 results, we can observe that the curves of (a)-(c) are
0 16 Y 0 16 Y 0 16 Y
X X X more fluent and have less sudden increase regions
(a) BERTBASE (b) EfficientBERT (c) Base Model
than (d)-(f); and from (a) to (c), the curve complex-
2.10
-0.004 Z 1.80 Z
3.90 Z
3.60
ity gradually decreases. It reflects that BERTBASE
-0.008 1.50
-0.012 3.30 (our teacher model) has the best FFN nonlinearity,
0 0 0
16 8
0
8
16 Y
16 8
0
8
16 Y
16 8
0
8
16 Y
and our EfficientBERT has better nonlinearity than
X X X
(d) Candidate Model 1 (e) Candidate Model 2 (f) Candidate Model 3 the base model and the randomly selected candi-
date models. This verifies the superiority of our
method in gaining better nonlinear mapping ability.
Figure 5: Visualization towards the FFN nonlinearity
More visualization of the nonlinearity can be seen
of (a) BERTBASE , (b) our EfficientBERT, (c) our base
model, and (d)-(f) randomly selected candidate models
in Figure 7-8 of the Appendix B.
with worse performances in the first search stage.
5 Conclusion
In this paper, we focus on the compression and
search stage between searching with and without
improvement of FFN and design a profound search
our warm-up KD strategy, as shown in Figure 4.
space over the nonlinearity of MLP in FFN, aiming
From the results, we can observe that the loss with
at searching for better MLP architectures to im-
our warm-up KD can reach a lower value with
prove the model performance. Due to the enormous
much fewer steps.
search space, we conduct NAS in a progressive
Transferability across Different Base Models. manner and employ a novel warm-up KD strategy
To test the transferability of our EfficientBERT at each search stage to accelerate searching and en-
across different base models, we replace the archi- hance model transferability. Extensive experiments
tecture of TinyBERT6 with that of the Efficient- show that our searched architecture EfficientBERT
BERT and evaluate it on the GLUE test set. For is 6.9× smaller and 4.4× faster than BERTBASE ,
both models, we use English Wikipedia to pre-train and has competitive performance and strong gen-
for three epochs from scratch to be consistent with eralization ability. In the future, we will leverage
Jiao et al. (2020). Note that the intermediate ex- NAS to discover more dynamic PLMs w.r.t differ-
pansion ratio in our search space is applied to the ent hardwares and downstream tasks.
original intermediate hidden size of TinyBERT6
(i.e., 3072). The results are shown in Table 7. From Acknowledgements
the results, we can observe that our EfficientBERT This work was supported in part by National
with the base model of TinyBERT6 outperforms the Key R&D Program of China under Grant
original TinyBERT6 on most of the downstream No. 2020AAA0109700, National Natural
tasks, and has gained an improvement of 1.3 on the Science Foundation of China (NSFC) un-
average GLUE score, showing the strong transfer- der Grant No.U19A2073 and No.61976233,
ability of our EfficientBERT. Guangdong Province Basic and Applied Basic
Visualization of FFN Nonlinearity. In Figure Research (Regional Joint Fund-Key) Grant
5, we typically visualize the FFN nonlinearity of No.2019B1515120039, Guangdong Outstanding
BERTBASE , our EfficientBERT, our base model, Youth Fund (Grant No. 2021B1515020061),
and three randomly selected candidate models with Shenzhen Fundamental Research Program
worse performances in the first search stage. The (Project No. RCYX20200714114642083, No.
input embedding of each model has two dimen- JCYJ20190807154211365).
sions serving as axes X and Y, whose values are
uniformly selected from -15∼5 to approximate the
References
distribution of the embedding in BERTBASE . The
average output of the last Transformer layer is re- Iz Beltagy, Matthew E. Peters, and Arman Cohan.
garded as the value of axis Z. Besides, we remove 2020. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150.
the MHA, replace the layer normalization with the
simple average operation, and set the weights and Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo
Giampiccolo, and Bernardo Magnini. 2009. The for mobilenetv3. In Proceedings of the IEEE/CVF
fifth pascal recognizing textual entailment challenge. International Conference on Computer Vision.
In Text Analysis Conference.
Forrest Iandola, Albert Shaw, Ravi Krishna, and Kurt
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Keutzer. 2020. SqueezeBERT: What can computer
and Song Han. 2020. Once-for-all: Train one net- vision teach NLP about efficient neural networks?
work and specialize it for efficient deployment. In In Proceedings of SustaiNLP: Workshop on Simple
International Conference on Learning Representa- and Efficient Natural Language Processing, pages
tions. 124–135.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng
Gazpio, and Lucia Specia. 2017. SemEval-2017 Chen, Jiashi Feng, and Shuicheng Yan. 2020. Con-
task 1: Semantic textual similarity multilingual and vbert: Improving bert with span-based dynamic con-
crosslingual focused evaluation. In Proceedings of volution. arXiv preprint arXiv:2008.02496.
the 11th International Workshop on Semantic Evalu-
ation, pages 1–14. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, 2020. TinyBERT: Distilling BERT for natural lan-
Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, guage understanding. In Findings of the Association
Wei Lin, and Jingren Zhou. 2020. Adabert: Task- for Computational Linguistics: EMNLP 2020, pages
adaptive bert compression with differentiable neural 4163–4174.
architecture search. In Proceedings of the Twenty-
Ninth International Joint Conference on Artificial In- Maurice G Kendall. 1938. A new measure of rank cor-
telligence, pages 2463–2469. relation. Biometrika, pages 81–93.
Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Quora question pairs. Quora. Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. Albert: A lite bert for self-supervised learning
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of language representations. In International Con-
Kristina Toutanova. 2019. BERT: Pre-training of ference on Learning Representations.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of Hector J Levesque, Ernest Davis, and Leora Morgen-
the North American Chapter of the Association for stern. 2011. The winograd schema challenge. In
Computational Linguistics: Human Language Tech- AAAI Spring Symposium: Logical Formalizations of
nologies, pages 4171–4186. Commonsense Reasoning.
William B. Dolan and Chris Brockett. 2005. Automati- Changlin Li, Jiefeng Peng, Liuchun Yuan, Guangrun
cally constructing a corpus of sentential paraphrases. Wang, Xiaodan Liang, Liang Lin, and Xiaojun
In Proceedings of the Third International Workshop Chang. 2020. Block-wisely supervised neural ar-
on Paraphrasing. chitecture search with knowledge distillation. In
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas IEEE/CVF Conference on Computer Vision and Pat-
Loukas. 2021. Attention is not all you need: Pure tern Recognition, pages 1986–1995.
attention loses rank doubly exponentially with depth.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
arXiv preprint arXiv:2103.03404.
feng Gao. 2019. Multi-task deep neural networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian for natural language understanding. In Proceedings
Sun. 2015. Delving deep into rectifiers: Surpass- of the 57th Annual Meeting of the Association for
ing human-level performance on imagenet classifi- Computational Linguistics, pages 4487–4496.
cation. In International Conference on Computer
Vision, pages 1026–1034. Paul Michel, Omer Levy, and Graham Neubig. 2019.
Are sixteen heads really better than one? In Ad-
Dan Hendrycks and Kevin Gimpel. 2016. Bridg- vances in Neural Information Processing Systems,
ing nonlinearities and stochastic regularizers with pages 14014–14024.
gaussian error linear units. arXiv preprint arXiv:
1606.08415. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you don’t know: Unanswerable ques-
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. tions for SQuAD. In Proceedings of the 56th An-
2015. Distilling the knowledge in a neural network. nual Meeting of the Association for Computational
In NIPS Deep Learning and Representation Learn- Linguistics, pages 784–789.
ing Workshop.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Andrew Howard, Mark Sandler, Grace Chu, Liang- Percy Liang. 2016. SQuAD: 100,000+ questions for
Chieh Chen, Bo Chen, Mingxing Tan, Weijun machine comprehension of text. In Proceedings of
Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, the 2016 Conference on Empirical Methods in Natu-
Quoc V. Le, and Hartwig Adam. 2019. Searching ral Language Processing, pages 2383–2392.
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Linnan Wang, Saining Xie, Teng Li, Rodrigo Fonseca,
Preslav Nakov. 2020. Poor man’s bert: Smaller and Yuandong Tian. 2019. Sample-efficient neural
and faster transformer models. arXiv preprint architecture search by learning action space. arXiv
arXiv:2004.03844. preprint arXiv:1906.06832.

Victor Sanh, Lysandre Debut, Julien Chaumond, and Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan
Thomas Wolf. 2020. Distilbert, a distilled version Yang, and Ming Zhou. 2020b. Minilm: Deep
of bert: smaller, faster, cheaper and lighter. arXiv self-attention distillation for task-agnostic compres-
preprint arXiv:1910.01108. sion of pre-trained transformers. arXiv preprint
arXiv:2002.10957.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
Yao, Amir Gholami, Michael W. Mahoney, and Kurt Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
Keutzer. 2020. Q-bert: Hessian based ultra low pre- man. 2019. Neural network acceptability judgments.
cision quantization of bert. In Proceedings of the Transactions of the Association for Computational
AAAI Conference on Artificial Intelligence, pages Linguistics, pages 625–641.
8815–8821.

David R. So, Quoc V. Le, and Chen Liang. 2019. The Adina Williams, Nikita Nangia, and Samuel Bowman.
evolved transformer. In Proceedings of the 36th In- 2018. A broad-coverage challenge corpus for sen-
ternational Conference on Machine Learning, pages tence understanding through inference. In Proceed-
5877–5886. ings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational
Richard Socher, Alex Perelygin, Jean Wu, Jason Linguistics: Human Language Technologies, pages
Chuang, Christopher D. Manning, Andrew Ng, and 1112–1122.
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree- Jin Xu, Xu Tan, Renqian Luo, Kaitao Song, Jian
bank. In Proceedings of the 2013 Conference on Li, Tao Qin, and Tie-Yan Liu. 2021. Nas-bert:
Empirical Methods in Natural Language Processing, Task-agnostic and adaptive-size bert compression
pages 1631–1642. with neural architecture search. arXiv preprint
arXiv:2105.14444.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.
Patient knowledge distillation for BERT model com- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
pression. In Proceedings of the 2019 Conference on bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Empirical Methods in Natural Language Processing Xlnet: Generalized autoregressive pretraining for
and the 9th International Joint Conference on Natu- language understanding. In Advances in Neural In-
ral Language Processing, pages 4323–4332. formation Processing Systems, pages 5753–5763.

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu,


Yiming Yang, and Denny Zhou. 2020. MobileBERT: Appendix
a compact task-agnostic BERT for resource-limited
devices. In Proceedings of the 58th Annual Meet- A Visualization of Searched Models
ing of the Association for Computational Linguistics,
pages 2158–2170.
We visualize the architectures of our base model,
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina the searched models of the first two search stages,
Toutanova. 2019. Well-read students learn better: and our EfficientBERT in Figure 6 from (a) to
On the importance of pre-training compact models. (d). From the architecture, we can observe that
arXiv preprint arXiv:1908.08962. our EfficientBERT is more efficient since most of
the searched intermediate expansion ratios are 1/2
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel Bowman. 2018. while most of the searched stack numbers are less
GLUE: A multi-task benchmark and analysis plat- than 2. Besides, in our EfficientBERT, lower layers
form for natural language understanding. In Pro- tend to have more FFN stack number or intermedi-
ceedings of the 2018 EMNLP Workshop Black- ate expansion ratio (e.g., layer 1, 2) so as to enrich
boxNLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 353–355. the semantic representation to the maximum extent
for processing by higher layers. In comparison,
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, higher layers tend to learn more complex math-
Ligeng Zhu, Chuang Gan, and Song Han. 2020a. ematical formulas (e.g., layers 4, 5) to enhance
HAT: Hardware-aware transformers for efficient nat-
ural language processing. In Proceedings of the
the nonlinearity of lower enriched representations.
58th Annual Meeting of the Association for Compu- This could bring many inspirations for efficient and
tational Linguistics, pages 7675–7688. effective backbone design.
B Visualization of Model Nonlinearity
To further show the superior nonlinearity of our
searched models, we visualize the attention maps
in twelve attention heads for BERTBASE , our Ef-
ficientBERT, TinyBERT6 , and our EfficientBERT
(TinyBERT6 ) in Figure 7, respectively. As can
be seen, the feature maps of our EfficientBERT
are close to those of BERTBASE . This verifies the
nonlinear mapping ability of our EfficientBERT in
fitting the teacher model. Moreover, the attention
distributions of our EfficientBERT (TinyBERT6 )
are closer to BERTBASE than TinyBERT6 in most
of the attention heads. This proves the excellent
nonlinear representation ability of our Efficient-
BERT (TinyBERT6 ) again.
Then, we visualize the feature maps of FFN
outputs for the above four models, as shown in
Figure 8. The observations in Figure 8 are similar
to that of Figure 7, once again demonstrating the
superior nonlinear representation ability of our
EfficientBERT and EfficientBERT (TinyBERT6 ).
Add & Norm Add & Norm

LeakyReLU LeakyReLU

GeLU GeLU
Add & Norm

LeakyReLU
2 2
MHA MHA
Add & Norm
GeLU
Add & Norm Add & Norm

GeLU 2
ReLU ReLU
MHA
1 Mul Mul
Add & Norm
MHA
1 1
Add & Norm
GeLU MHA MHA

Add & Norm Add & Norm


GeLU 1
MHA
1 GeLU GeLU
Add & Norm
MHA

Add & Norm Max Max


GeLU
Tanh Tanh
4 2
GeLU 4 MHA MHA
MHA
Add & Norm Add & Norm
1
Add & Norm Sigmoid Sigmoid
MHA
Sigmoid
Add & Norm
ReLU ReLU
GeLU
GeLU 1 1
1 MHA MHA
1 MHA
Add & Norm Add & Norm
MHA
Add & Norm ELU ELU
Add & Norm

GeLU GeLU GeLU


GeLU
2 2 2
1 MHA MHA MHA
MHA
Add & Norm Add & Norm Add & Norm

Add & Norm Tanh Tanh Tanh

GeLU GeLU GeLU GeLU

1 4 4 4
MHA MHA MHA MHA

Embedding Embedding Embedding Embedding

(a) Base Model (b) Search Stage 1 (c) Search Stage 2 (d) EfficientBERT

Figure 6: Architectures of (a) our base model, (b)-(c) the searched models of the first two search stages, and (d)
our EfficientBERT.
(a) BERTBASE

(b) EfficientBERT

(c) TinyBERT6

(d) EfficientBERT (TinyBERT6 )

Figure 7: Visualization for the attention distributions of (a) BERTBASE , (b) our EfficientBERT, (c) TinyBERT6 ,
and (d) our EfficientBERT (TinyBERT6 ) in the last Transformer layer.

(a) BERTBASE

(b) EfficientBERT

(c) TinyBERT6

(d) EfficientBERT (TinyBERT6 )


Figure 8: Visualization for the FFN output distributions of (a) BERTBASE , (b) our EfficientBERT, (c) TinyBERT6 ,
and (d) our EfficientBERT (TinyBERT6 ) in the last Transformer layer.

You might also like