0% found this document useful (0 votes)
3 views12 pages

Research Paper

The document presents a novel pruning methodology for transformer architectures that is computationally inexpensive and does not require retraining, addressing the limitations of existing methods. The proposed framework achieves significant reductions in FLOPs while maintaining accuracy, with pruning times drastically shorter than traditional methods. The methodology was tested on BERT BASE and DistilBERT, demonstrating a reduction in computational overhead and adaptability for practical deployment in natural language processing tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Research Paper

The document presents a novel pruning methodology for transformer architectures that is computationally inexpensive and does not require retraining, addressing the limitations of existing methods. The proposed framework achieves significant reductions in FLOPs while maintaining accuracy, with pruning times drastically shorter than traditional methods. The methodology was tested on BERT BASE and DistilBERT, demonstrating a reduction in computational overhead and adaptability for practical deployment in natural language processing tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Title : A Fast Re-training free Correlated-Fisher based Pruning Method for transformers.

Writers Shobhit Bansal (Delhi Technological University) ([email protected])


Utkarsh Saxena (Delhi Technological University) ([email protected])
Yamini Srivastava (Delhi Technological University) ([email protected])

Abstract :
Transformers have revolutionized the field of natural language processing whether it is classification
tasks or Question-Answering tasks. However, their practical applications have been limited so far due
to their heavy architectures and high computational overhead. Pruning has been proven effective to
reduce the high overhead of deep neural network architectures. But most of existing pruning
methodologies in the field of Transformers are computationally expensive and add complexity to model
deployment in many real-life situations. To address this, we propose a pruning methodology that is
computationally inexpensive and free of re-training. Given a FLOPs constraint and a sample dataset of
about 2k samples, the method automatically prunes the architecture using structured sparsity methods.
To retain high accuracy without retraining, our method follows the following three steps. (i) A simple
lightweight mask search algorithm that finds which nodes to prune based on Gradient Correlation and
Fisher Information; (ii) Rearranging Mask Values and (iii) Rescaling the Mask to reconstruct the output
activations. We tested our methods on BERT BASE and DistilBERT architectures using NLP benchmark
tasks of SQuAD and GLUE datasets. Our framework achieved upto 1.75x reduction in FLOPs while
maintaining accuracy delta <2%. The highlight being that our framework pruned heavy transformer
architectures is less that 5 minutes on a single GPU where some previous methods took nearly 30 hours.

1. Introduction
Ever since Attention was introduced in 2017[1], Transformer architectures have become an industry
standard for Natural Language Processing Tasks [2, 3] and now these architectures are being commonly
used in the fields of Computer Vision [4, 5] and Speech Recognition[6, 7] as well. Transformers have
become widely accepted but their efficient deployment in pipelines has always been a challenge due to
their heavy architectures and high computational overhead. Structured pruning is proven to solve this
challenge and is a very popular area of study.
Prior work on pruning Transformers have been proven successful in reducing the high inference times.
However, these methods are often not of practical use due to lot of reasons. Firstly, retraining. Almost
every method requires retraining of the model and learning the method configurations during training
time as can be seen [8,9]. These methods significantly reduced inference times but increases the training
time by nearly 10x adding computational overhead to the model. Secondly, These methods make the
model very fragile by adding a lot of variable hyperparameters, essentially adding a ton of moving parts
to an already very oiled-machine. These require additional engineering efforts for hyperparameter
tuning, deployment and debugging. This has affected the adoption in commercial pipelines. Thirdly,
these frameworks are not adaptable to practical needs. These are trained on very rigid constraints that
leave negligible to no room for constraint flexibility to provide a general framework for a variety of
user/practical needs. This results in sub-optimal pruned models irrelevant of the user hardware
capabilities.
Many Post-training methods have been explored. (See Section 2.3) Although, pre-training quantization
methods have been proven to be deliver better model compression in general, Post-Training methods
are generally preferred in both academia and industry due to their easier implementation with negligible
additional overhead.
Figure 1. Overview of our method against existing pruning methods
We propose a Correlated transformer Pruning framework. It has three major advantages. (i). Our
method avoids expensive retraining on a large dataset. This reduces pruning time for hours of
computation to mere minutes; and (ii) Flexible to practical needs. The framework takes a fine-tuned
model, FLOPs constraint value and a sample dataset as an input, and returns a pruned model that can
be deployed in CI/CD pipelines almost immediately. The framework firstly, searches the optimal mask
to decide which head/nodes to prune using correlation coefficients among network gradients and fisher
information (Section 4.1). Secondly, a greedy mask rearrangement algorithm is applied to restore the
accuracy lost to mask search (Section 4.2) and finally a mask tuning that ensures that the output signal
is recovered for each layer. We performed extensive testing for BERT BASE (see Table 1) and DistillBERT
(see Table 2) architectures for SQuAD and GLUE benchmarks achieving a 25% to 40% reduction in
FLOPs while staying within 2% of accuracy loss. The highlight of this method is the pruning time. With
only 50 seconds for GLUE and 200 seconds for SQuAD tasks, this methods is almost 100x faster than
retraining-based pruning methods.
2. Related Work
2.1. Efficient Transformers
These methods focus on reducing inference time and memory footprint of Transformers through
methods other than Pruning. These can be broadly categorized as (i) efficient architecture design [10,
11, 12, 13, 14]. (ii) hardware-software co-design [15, 16, 17, 18], (iii) knowledge distillation [19, 20,
21, 22] among others. However, we do not focus on these methods for our work. Hence a detailed
discussion of these methods is out of purview for this article.
2.2. Transformer Pruning
[23] discusses the state of sparsity in deep neural networks, highlighting the importance of pruning
techniques. [24] propose movement pruning, adapting sparsity through fine-tuning.[25] introduces the
Optimal BERT Surgeon, a scalable second-order pruning method for large language models.[26, 27]
present the lottery ticket hypothesis, suggesting efficient training of BERT models through sparse
networks. [28] originally proposed the lottery ticket hypothesis, advocating for trainable sparse neural
networks. [29] further explores the lottery ticket hypothesis in BERT models, demonstrating its
effectiveness. These methods have been effective at model compression, hardware accelerators like
GPUs and TPUs hardly take advantage of the unstructured sparse patterns for inference speedup.
Due to this reason, many structured pruning methods were explored to remove coarse-grained sets of
parameters in Transformers. [30] propose structured pruning for large language models, systematically
removing less critical neurons to compress models effectively. [31] introduces hardware-friendly block
structured pruning, optimizing transformers for deployment on resource-constrained hardware. [32]
present tile-wise sparsity, a method to accelerate sparse DNN models without hardware support by
optimizing sparse computation. [33, 34] explore attention head specialization, suggesting that certain
heads can carry most of the workload, allowing for the pruning of less essential heads. [35] introduce
structured dropout to dynamically reduce transformer depth during training, offering flexibility in model
compression. [36] investigate the effects of dropping layers in pre-trained transformer models,
analyzing trade-offs between model size reduction and performance retention.
Unfortunately, while structured methods have achieved considerable model compressions and
remarkable inference speedups, they are often difficult to use in practice, mainly, the high computational
cost of additional retraining which can be as high as 10x [8, 9] compared to the original model training
time. A complication might arise in complicated deployment pipeline where each pruning stage
frequently requires rewriting training code and also additional hyperparameters to tune.
2.3. Post-Training Quantization
Post-training Quantization methods have been widely studied in academia and preferred in industry
deployment pipelines due to reduction of additional training computational overhead. [37] introduces
post-training 4-bit quantization for convolution networks, providing a method for rapid deployment by
reducing the bit-width of weights and activations. [38] propose layer-wise calibration and integer
programming to improve post-training neural quantization, enhancing the accuracy of quantized
models. [39] present adaptive rounding for post-training quantization, introducing a method that
dynamically adjusts rounding directions to minimize quantization error, thus improving model
accuracy. [40] explore improving neural network quantization without retraining through outlier
channel splitting, a technique that splits outlier channels into separate quantization groups to enhance
quantization accuracy.
Post-training schemes have also been explored for structured pruning as well as unstructured pruning
of CNNs. [41] proposed neuron merging as a compensatory mechanism for pruned neurons in CNNs,
aiming to mitigate the performance degradation caused by pruning. By merging the weights of pruned
neurons into surviving ones, they maintain model accuracy while achieving significant compression.
[42] introduce data-free parameter pruning, a method to prune parameters from deep neural networks
without relying on additional data or retraining. This technique reduces model size and computational
complexity while preserving performance, making it suitable for resource-constrained environments.
[43] presents RED (Looking for Redundancies), a method for data-free structured compression of deep
neural networks. By identifying and removing redundant structures within the network, such as
redundant filters or neurons, RED achieves compression without sacrificing accuracy, making it
applicable to various network architectures. However it is extremely difficult to extend these methods
to transformers because pruning for CNNs exploit their repeating structure and element-wise non-
linearity which is not the case with Transformers.
Due to the reason that existing post-training methods for CNNs can’t be extended to Transformers, we
propose a general method. Our work is mainly focused around Transformers, but the approach we are
presenting should be general enough to be applied to CNNs.

3. Methodology
3.1. Notations & Terminologies
This work focuses on encoder-based Transformer architectures i.e. the BERT family [3]. These
architectures are stacks of Transformer encoder blocks of Multi-head Attention (The mathematical
functions is referred to as MHA hereafter) nodes [1] and simple feed-forward Neural Networks
(Function referred to as FFN hereafter) stacked together. Within an Encoder block, a MHA consists of
H independent heads :
MHA(𝑥) = ∑𝐻
𝑖=1 𝐴𝑡𝑡𝑖 (𝑥), 𝑥MHA = LayerNorm⁡(𝑥 + MHA⁡(𝑥)) (1)

(2) (1) (1)


𝐹𝐹𝑁(𝑥) = ∑𝑁
𝑖=1 𝜔:,𝑖 𝜎 (𝑤𝑖,: 𝑥 + 𝑏𝑖 ) + 𝑏
(2)
, 𝑥out = LayerNorm⁡(𝑥MHA + FFN⁡(𝑥MHA )) (2)
Where Atti is the scaled dot product attention function, and x is the embedded input sequence. For a
FFN consisting of N filters, W(1), W(2) , b(1) and b(2) are the FFN parameters namely weights and biases
of the feed-forward network and σ is the activation function which is typically GLEU in the BERT
family architectures. The tuple of (H, N) is typically (12, 3072) for BERT BASE and (16, 4096) for
BERTLARGE .

Figure 2. Overview of the pruning method (a) Initially every head/filter is left unpruned. (b) Mask
Search as in Section 3.3 (c) Rearrangement as in Section 3.4 (d) Mask Tuning as in Section 3.5
As introduced earlier, we pose our pruning method as searching for a sparse mask for both heads and
filters of the architecture. The equations for the architecture change with the addition of the mask as :
𝐿,𝑖
𝐿
𝑀𝐻𝐴(𝑥; 𝑚𝑀𝐻𝐴 ) = ∑𝐻
𝑖=1 𝑚𝑀𝐻𝐴 ∘ 𝐴𝑡𝑡𝑖 (𝑥) (3)

(2) (1) (1))


FFN⁡(𝑥; 𝑚𝐿𝐹𝐹𝑁 ) = ∑𝑁 𝐹𝐹𝑁
𝑖=1 𝑚𝑙,𝑖 ⋅ 𝜔𝑖,𝑖 𝜎 (𝜔𝑖,𝑖 𝑥 + 𝑏𝑖 )
(4)
+𝑏(2)

To sum up, there will be LH head mask variables and LN filter mask variables summing up to L(H +
N) mask variables. To simplify our calculations going forward we constrain the mask variables as mMHA
∈ RLH , mFFN ∈ RLN and m ∈ RL(H+N) as the flattened vectors of head, filter and total mask variables
Extent of Pruning. The scope of this method is to prune the Attention heads and the feed forward
networks by the development of structured sparse masks. We do not consider pruning the initial input
embeddings/positional encodings and the final classifier as pruning them do not significantly speedup
the inference times. The cost-benefit analysis for pruning these layers are not worth the pruning efforts.
Inputs. The framework has three inputs, a fine-tuned transformer model for a down streamed NLP task;
a sample dataset, a randomly generated sample dataset as a subset of the training dataset of about some
2K samples, which however can be tuned for any specific practical use case; and a constraint value for
the target FLOPs. This is inputted as the fraction of the original FLOPs for the pruning target.
3.2. Problem Formulation.
We can consider transformer pruning as constrained optimization mask problem on the mask m :
arg⁡min 𝐿(𝑚) such that Cost⁡(𝑚) ≤ 𝐶 (5)
Where L denotes the loss function, Cost is the FLOPs of the architecture pruned by the mask and C is
the given constraint value. Although, such an optimization problem is generally intractable to solve as
the cost function is generally defined as the l0 norm of the mask values, which is non-differential.
We can approximate the loss function using the second-order Taylor expansion with the initial mask mi
as :
1
𝐿(𝑚) ≈ 𝐿(𝑚𝑖 ) − 𝑔⊤ (𝑚𝑖 − 𝑚 ) + 2 (𝑚𝑖 − 𝑚 )⊤ 𝐻 (𝑚𝑖 − 𝑚 ) (6)
1
≈ 𝐿 (𝑚𝑖 ) + 2 (𝑚𝑖 − 𝑚 )⊤ 𝐻 (𝑚𝑖 − 𝑚 ) (7)
Where ,
∂ ∂2
𝑔 = 𝐸[ 𝐿(𝑚𝑖 )] and 𝐻 = 𝐸 [ 𝐿(𝑚𝑖 )]
∂𝑚 ∂𝑚 2

We further assume that the loss function has converged to a local minimum, where the gradient term
goes close to 0[44]. mi will be referred to as the mask matrix of all ones or the very initial mask variable
values. As L(mi) will be a constant, it can be ignored and objective function can be rewritten as :
argmin⁡𝐿(𝑚) ≈ arg⁡min(𝑚𝑖 − 𝑚 )𝑡 𝐻(𝑚𝑖 − 𝑚 ) (8)
Now, forming the exact Hessian matrix can be infeasible we approximate the Hessian H with the
empirical fisher information and gradient correlation of the mask variables as :
𝑡
1 ∂ ∂
𝐼: = 𝐷 ∑(𝑥,𝑦⁡∈D] ∈ 𝐷 (∂𝑚 𝐿(𝑥, 𝑦; 𝑚𝑖 )) (∂𝑚 𝐿(𝑥, 𝑦; 𝑚𝑖 )) (9)

Where D is the sample dataset and (x, y) is tuple in the sample dataset.

3.3. Correlated - Fisher Based Mask Search :


Importance Scores. It is intractable to solve the optimization problem using the full fisher-information
matrix I. We can therefore assume that I is diagonal. This further simplifies the Eq 8. :
2
arg⁡min𝐿(𝑚) ≈ arg⁡min ∑𝑗( (1 − 𝑚𝑗 ) 𝐼𝑗𝑗 (10)

For the first two stages of our pruning, we restrict the mask variables to be binary i.e. 0 and 1 only. 0
indicating completely pruned and 1 being that the node is unpruned. With that assumption, the following
can be derived from Eq. 10.
argmin⁡𝐿(𝑚) ≈ arg⁡min ∑𝑗∈𝑧(𝑚) 𝐼𝑗𝑗 where 𝑧(𝑚): = {𝑗 ∣ 𝑚𝑗 = 0} (11)
We can interpret each diagonal element of I as importance scores of the head filter associated with the
mask variable. Such importance scores have also been introduced in [45, 46].We improve upon these
importance scores as follows. When we calculate the diagonal approximation of Fisher Information
matrix as explained in [45, 46], we also try and capture the correlation between heads/filters. This new
importance scores Icorr is calculated through Eq. 12 as follows.
𝐼Corr = 𝐼 × corr(grads) (12)
Where corr(grads) is the function to generate correlation coefficients between gradients in the sample
dataset.
Solving FLOPs-constrained Problem. Given a target FLOPs cost, denoted by C, we can formulate the
binary mask search problem through the following equation.
argmin⁡∑𝑗∈𝑧(𝑚) 𝐼𝑗𝑗 such that 𝐹head ||𝑚𝑀𝐻𝐴 ||0 + 𝐹filter ||𝑚 𝐹𝐹𝑁 ||0 ≤ 𝐶 (13)

Where Fhead ∈ R and Ffilter ∈ R are the FLOPs for computing head and filter respectively. FLOPs i.e.
Floating-Point Operations are chosen to be suitable for pruning as they are always constant irrespective
of the hardware and computation setting the use case might demand. Such an optimization problem
described in Eq. 9 can be solved using a Knapsack algorithm as described in [47], certain observations
made in [48] allowed for a faster polynomial time solution. (i) having more heads and filters unpruned
always optimizes Eq.9 since the diagonal elements of I are non-negative; and (ii) if a certain number of
heads needs to be pruned, they should be the ones with the lowest importance scores because each head
accounts for the same amount of FLOPs. This leads to a Mask Search Algorithm as described in
Algorithm 1
Algorithm 1 : Mask Search
Input. FLOPs constraint C, Diagonal Correlated-Fisher Information matrix, Icorr
1. For n = 0 to LH do #remaining heads
2. k1 = LH − n #heads to prune
3. HI = indices of k1 least important heads
4. F = (C − nFhead) / Ffilter #remaining filters
5. k2 = LN − f #filters to prune
6. FI = indices of k2 least important filters
7. S[n] = ∑𝑖⁡∈𝐻𝐼⁡∪⁡𝐹𝐼 I
8. R[n] = (HI,FI)
9. End for
10. n∗⁡= arg min n S[n] #optimal remaining heads
11. HI∗,FI∗⁡= R[n∗] #indicies of heads/filters to prune
12. Initialize mMHA and mFFN as 1
13. mMHA[HI∗] = 0 #prune the selected heads
14. mFFN[FI∗]=0 #prune the selected filters
Output: m*⁡= (mMHA, mFFN)

3.4. Correlated – Fisher Based Mask Rearrangement


Block Diagonal Approximation. In the first stage, we assumed that the information matrix is diagonal.
While this approximation provides a solid footing to initialize a mask search, it is the main reason
behind the accuracy drop in models. To recover the loss, we consider our pruning problem by using a
block diagonal matrix rather than a diagonal matrix. Better visualized in Figure. 3

Figure 3. Block Diagonal Approximation of Fisher Matrix


Algorithm 2 : Mask Rearrangement
Input. Mask value searched in Mask Search, gradients for Samples
1. For n = 1 to L do # remaining layers
2. Sort importance scores with indices
3. H1 = indices of unpruned heads/filters
4. For k in H1 do
5. G = I[k] # Importance at k index
6. C = Total Importance - G # Complement of G
7. Remove index of C
8. end for
9. Recreate mask for layer n
10. end for
Output. Rearranged Mask m = (mMHA, mFFN)
Assuming that the number of heads/filters to be pruned are predetermined in each case (Mask Search),
the Eq. 8 breaks down to a set of layer-wise optimization problems, We can solve this problem with a
greedy algorithm as displayed in Algorithm 2.
𝑚ˆ𝐿 = arg⁡min(𝑚𝑖 − 𝑚𝑙 )⊤ 𝐼𝑙 (𝑚𝑖 − 𝑚𝑙 ) (14)

3.5. Mask Tuning


In the third stage, we relax our assumptions further where we let go of the restriction that the mask
values are binary. The nonzero variables in the mask ˆ m from Section 3.4 are tuned to any real values
such that the pruned model recovers its accuracy. The mask variables are tuned towards minimizing the
layer reconstruction error as suggested by [49]. This can be expressed mathematically as
2
arg min ||𝑥 + 𝑙ayer(𝑥; 𝑚𝑙 ) − (𝑥 ′ + layer(𝑥 ′ ; 𝑚𝑖 ))||2 (15)

where layer is either MHA or FFN, x and x’ is the input sequence of the pruned model and the original
model respectively. Here we compare the activations after the residual connection. Note that this stage
does not incur any change in model FLOPs ,as we only tune the non-zero mask variables. As depicted
in [48],the Eq. 11 can be reduced to a Least squares problem of type
argmin⁡||𝐴𝑚𝐿 − b||22 (16)

where the matrix A denotes head/filter output activations of the model pruned by the binary mask and
the vector b is the difference between the output activations of the two models.
The framework uses the LSMR solver of the CuPy package [50] with a regularization hyperparameter.
Concretely, we re-parameterize the least squares problem as follows where ml = mi + rl and solve it with
the hyperparameter fixed at 1.
argmin⁡||𝐴𝑟𝐿 + 𝐴 ⋅ 𝑚𝑖 − 𝑏||22 (17)

To mitigate the risk of adversely impacting accuracy during mask tuning, we implement a restriction
on the acceptable range of the tuned mask variable, limiting it to the interval [-10, 10]. If the mask for
a layer surpasses this range, indicating potentially detrimental adjustments, we discard the mask for that
layer and halt further tuning. This heuristic strategy ensures stability across various models, tasks, and
random seeds in our experiments. Notably, while these heuristics involve two hyperparameters, namely,
damp and the acceptable range, we find through empirical observation that these values remain
consistent across different tasks and models. Therefore, we adopt fixed values for these hyperparameters
throughout all our experiments, providing a robust and reliable approach to mask tuning without the
need for task-specific or model-specific adjustments.
4. Evaluation
4.1. Experimental Setup
Our framework is implemented on top of PyTorch [50] and the HuggingFace Transformers [51] library.
We evaluate the effectiveness of our approach using BERT BASE [3] and DistilBERT [20] on GLUE[52]
and SQuAD[53, 54] benchmarks. We use 2K examples from the training sets for pruning, and we
evaluate the resulting models on the development sets. All of the results are averaged over the runs with
10 different seeds.
4.2. Performance Evaluation
Tables 1 and 2 show the accuracy of BERT BASE and DistilBERT with a 75% FLOPs constraint on GLUE
and SQuAD datasets with only ~2-3% drop in accuracy. BERT BASE was able to achieve ~60-70 %
FLOPs for almost all tasks. DistilBERT also shows a similar pattern.
Task MNLI QQP QNLI SST-2 STSB MRPC SQuAD
Baseline 84.53 91 91.41 93.57 88.9 86.27 88.48
Mask Search 56.498 80.438 82.595 77.788 82.344 59.852 64.347
+ Rearrangement 79.208 88.978 88.364 89.965 87.432 83.161 80.782
+ Tuning 78.308 88.516 88.558 88.703 86.919 81.25 82.158
Table 1. Evaluation of accuracy metric for GLUE and SQuAD tasks of the BERTBASE architecture with
a 75% FLOPs constraint.
Task MNLI QQP QNLI SST-2 STSB MRPC SQuAD
Baseline 82.11 89.99 88.56 91.4 86.12 84.8 85.73
Mask Search 44.408 68.207 55 57.614 66.753 63.453 35.98
+ Rearrangement 76.386 86.503 83.614 87.913 84.092 79.637 74.833
+ Tuning 72.723 85.273 82.505 86.312 83.812 80.93 74.903
Table 2. Evaluation of accuracy metric for GLUE and SQuAD tasks of the DistilBERT architecture
with a 75% FLOPs constraint.
4.3. Comparison with Prior Methods
We compare our methods to prior work in DynaBERT [55], EBERT [56], BMP [8], and CoFi [9]. These
methods are admittedly better at model compression than our framework but then again retraining time
is where our framework takes the edge. As shown in Table 3, these methods require 5−33 hours of
retraining. On the other hand, our method finishes in less than a minute, which is 2−3 orders of
magnitude faster. We also highlight that this training latency analysis only accounts for a single
hyperparameter, and the entire cost should be multiplied by the size of the hyperparameter space. While
the prior methods rely on a considerable number of hyperparameters, ours introduce only two
hyperparameters (in Section 3.5) which we fix for all experiments.
Method No. of Epochs E2E time (in hr)
DynaBERT [55] 4 12
EBERT [56] 6 5
BMP[8] 20 33
CoFi [9] 40 17
Ours 0 0.01
Table 3. Pruning cost comparison between the prior structured pruning methods and ours. We compare
the number of training epochs and the end-to-end (E2E) time required for pruning.
5. Conclusion & Future Work
In this work we propose a pruning framework that requires no retraining and no specialized hardware
accelerators. Essentially the time required for our method is 0 when compared to other methods. Our
framework is a three-stage process : a fast correlated fisher based mask search algorithm; rearranging
the mask variables and then finally the tuning of mask values to recover loss to the output signal. We
empirically evaluate our framework using BERTBASE and DistilBERT, where our pruning method
achieves up to 70% FLOPs reduction within only 4% accuracy degradation on GLUE and SQuAD
datasets. More importantly, our end-to-end pruning pipeline only needs 39 and 200 seconds for GLUE
and SQuAD, which is 2−3 orders of magnitude faster than the prior methods.
Furthermore, to build upon our work, the sample dataset for mask search variables provides a scope of
improvement. There is no statistical sampling technique to sample this dataset. A fine-tuned sampling
technique that may differ from practical use case to use case. Constraints of latency instead of FLOPs
can be used, however this is specific to hardware accelerators so do not give a general idea of
framework performance.
6. References
[1]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pages 5998–6008, 2017.
[2] Tom BBrown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-
shot learners. arXiv preprint arXiv:2005.14165, 2020.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020
[5] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030, 2021.
[6] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A
framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477,
2020.
[7] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li,
Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training
for full stack speech processing. arXiv preprint arXiv:2110.13900, 2021.
[8] François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster
transformers. arXiv preprint arXiv:2109.04838, 2021
[9] Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate
models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1513–1528, 2022.
[10] Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. Squeezebert: What can
computer vision teach nlp about efficient neural networks? arXiv preprint arXiv:2006.11316, 2020
[11] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
International Conference on Learning Representations, 2019.
[12] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint
arXiv:1909.11942, 2019.
[13] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint
arXiv:2004.02984, 2020.
[14] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention
with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
[15] Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-
Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. Aˆ 3: Accelerating attention mechanisms in
neural networks with approximation. In 2020 IEEE International Symposium on High Performance
Computer Architecture (HPCA), pages 328–341. IEEE, 2020.
[16] Tae Jun Ham, Yejin Lee, Seong HoonSeo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W
Lee. Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural
networks. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture
(ISCA), pages 692–705. IEEE, 2021.
[17] Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato,
Victor Sanh, Paul Whatmough, Alexander M Rush, David Brooks, et al. Edgebert: Sentence-level
energy optimizations for latency-aware multi-task nlp inference. In MICRO-54: 54th Annual
IEEE/ACM International Symposium on Microarchitecture, pages 830–844, 2021
[18] Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient sparse attention architecture with
cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance
Computer Architecture (HPCA), pages 97–110. IEEE, 2021
[19] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun
Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351,
201
[20] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version
of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
[21] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model
compression. arXiv preprint arXiv:1908.09355, 2019.
[22] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-
attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint
arXiv:2002.10957, 2020.
[23] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv
preprint arXiv:1902.09574, 2019.
[24] Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: Adaptive sparsity by
fine-tuning. arXiv preprint arXiv:2005.07683, 2020.
[25] ] Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran,
Michael Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order
pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
[26] Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and
Michael Carbin. The lottery ticket hypothesis for pre-trained BERT networks. arXiv preprint
arXiv:2007.12223, 2020.
[27] Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, and Jingjing Liu.
Earlybert: Efficient bert training via early-bird lottery tickets. arXiv preprint arXiv:2101.00063, 2020.
[28] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv preprint arXiv:1803.03635, 2018.
[29] Sai Prasanna, Anna Rogers, and Anna Rumshisky. When BERT plays the lottery, all tickets are
winning. arXiv preprint arXiv:2005.00561, 2020.
[30] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models.
arXiv preprint arXiv:1910.04732, 2019.
[31] Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, and Caiwen Ding.
Efficient transformer-based large scale language representations using hardware-friendly block
structured pruning. arXiv preprint arXiv:2009.08065, 2020.
[32] Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia,
Xipeng Li, Minyi Guo, and Yuhao Zhu. Accelerating sparse dnn models without hardware-support via
tile-wise sparsity. In SC20: International Conference for High Performance Computing, Networking,
Storage and Analysis, pages 1–15. IEEE, 2020.
[33] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? arXiv
preprint arXiv:1905.10650, 2019.
[34] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head
selfattention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418, 2019.
[35] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
structured dropout. arXiv preprint arXiv:1909.11556, 2019.
[36] Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers
of pre-trained transformer models. arXiv preprint arXiv:2004.03844, 2020.
[37] RonBanner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Post-training 4-bit quantization of
convolution networks for rapid-deployment. arXiv preprint arXiv:1810.05723, 2018.
[38] Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving post
training neural quantization: Layer-wise calibration and integer programming. arXiv preprint
arXiv:2006.10518, 2020
[39] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort.
Up or down? adaptive rounding for post-training quantization. In International Conference on
Machine Learning, pages 7197–7206. PMLR, 2020.
[40] Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, and Zhiru Zhang. Improving neural
network quantization without retraining using outlier channel splitting. Proceedings of Machine
Learning Research, 2019.
[41] Woojeong Kim, Suhyun Kim, Mincheol Park, and Geonseok Jeon. Neuron merging:
Compensating for pruned neurons. arXiv preprint arXiv:2010.13160, 2020.
[42] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.
arXiv preprint arXiv:1507.06149, 2015
[43] Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. Red: Looking for
redundancies for data-freestructured compression of deep neural networks. Advances in Neural
Information Processing Systems, 34:20863–20873, 2021.
[44] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural
information processing systems, pages 598–605, 1990.
[45] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance
estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 11264–11272, 2019
[46] ] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
dense networks and fisher pruning. arXiv preprint arXiv:1801.05787, 2018.
[47] Yonathan Aflalo, Asaf Noy, Ming Lin, Itamar Friedman, and Lihi Zelnik. Knapsack pruning with
inner distillation. arXiv preprint arXiv:2002.08258, 2020
[48] Woosuk Kwon, Sehoon Kim, Micheal W. Mahoney Joseph Hassoun, Kurt Keutzer, and Amir
Gholami, A Fast Post-Training Pruning Framework for Transformers arXiv preprint
arXiv:2204.09656v2, 2022
[49] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural
networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397,
2017.
[50] AdamPaszke, SamGross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style,
high-performance deep learning library. Advances in neural information processing systems,
32:8026–8037, 2019.
[51] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi,
Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of-the-art
natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pages 38–45, 2020.
[52] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv
preprint arXiv:1804.07461, 2018.
[53] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
questions for squad. arXiv preprint arXiv:1806.03822, 2018.
[54] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
[55] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic
bert with adaptive width and depth. arXiv preprint arXiv:2004.04037, 2020.
[56] Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. Ebert: Efficient bert inference with dynamic
structured pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021,
pages 48144823, 2021.

You might also like