A Knowledge Distillation Integrated Pruning Method For Vision Transformer

Uploaded by

cy.zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views6 pages

A Knowledge Distillation Integrated Pruning Method For Vision Transformer

Uploaded by

cy.zzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2022 21st International Symposium on Communications and Information Technologies

A Knowledge-Distillation-Integrated Pruning
Method for Vision Transformer
Bangguo Xu1 , Tiankui Zhang1 , Yapeng Wang2 , Zeren Chen3
1 School of Information and Communication Engineering, Beijing University of Posts and Telecommunicaions,
2022 21st International Symposium on Communications and Information Technologies (ISCIT) | 978-1-6654-9851-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ISCIT55906.2022.9931309

Beijing 100876, China

2 Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
3 Jiangxi Xinborui Technology Co, Ltd, Yingtan 338000, China

{xubangguo, zhangtiankui}@bupt.edu.cn, [email protected], [email protected]

Abstract—Vision transformers (ViTs) have made remarkable complexity of the model, saves computing resources and
achievements in various computer vision applications such as speeds up the inference speed of the model.
image classification, object detection, and image segmentation. In recent years, with the introduction of Transformer into the
Since the self-attention mechanism introduced by itself can model
the relationship between all pixels of the input image, the per- field of computer vision, its powerful global feature extraction
formance of the ViTs model is significantly improved compared ability has led it to surpass traditional CNN models in terms of
to the traditional CNN network. However, their storage, runtime accuracy. The compression work for the Vision Transformer
memory and computing requirements hinder their deployment (ViT) model has just started. Due to the complex structure of
on edge devices. This paper proposes a ViT pruning method the ViT model introduced by the self-attention mechanism and
with knowledge distillation, which can prune the ViT model and
avoid the performance loss of the model after pruning. Based on the low redundancy of the model parameters, the model can be
the idea that knowledge distillation can make the student model tailored without losing the performance of the model. became
improve the performance of the model by learning the unique a challenge. Since the ViT model structure does not have a
knowledge of the teacher model, the convolution neural network convolutional structure, it cannot directly delete channels to
(CNN) which has the unique ability of parameter sharing and achieve pruning. Using the pruning method previously applied
local receptive field is used as a teacher model to guide the
training of the ViT model and enable the ViT model to obtain the to the CNN model will greatly damage the performance of the
same ability. In addition, some important parts may be cut during model.
pruning, resulting in irreversible loss of model performance.
To solve this problem, this paper designs the importance score
learning module to guide the pruning work, and determines
that the pruning work removes the unimportant parts of the
model. Finally, this paper compares the pruned model with other
methods in terms of accuracy, Floating Point Operations(FLOPs)
and model parameters on ImageNet-1K.
Index Terms—knowledge distillation, network pruning, trans-
former pruning, vision transformer

I. I NTRODUCTION
The development of computer vision technology is insepa- Fig. 1. Parameter comparison of ResNet50 and ViT/B16 models.
rable from the promotion of convolutional neural networks.
At present, the achievements of the CNN model in image The heatmap shown in Fig.1 shows the difference in model
recognition are close to bottleneck, and with the development parameters between the regular CNN model ResNet50 and
of the CNN model, there has been considerable accumulation the regular version ViTB/16 in the ViT model. The heatmap
of lightweight work for it. Neural network pruning methods consists of 30×30 parameters randomly selected from the
have received more and more attention due to their widespread weight matrix. The lighter the color, the closer the parameter
presence in compression tasks. By pruning a large number of is to 0. It can be seen that the parameters of the CNN model
unimportant parameters or directly removing a large number contain a large number of parameters close to 0, so it can
of channels in each layer of the model structure, the size of be tailored without losing model performance. Most of the
the model can be reduced without loss of performance. The parameters of the ViT model are not close to 0, indicating that
trimmed model not only saves a lot of storage space due to the most of the parameters in the ViT model are very important.
reduction of parameters, but also reduces the computational If the pruning method using the CNN model is deleted, it
This work is supported by Key Technology Research Project of Jiangxi will lead to serious loss of model performance, which greatly
Province (20213AAE01007) increases the difficulty of pruning. Therefore, how to prune

Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.
the ViT model without losing its performance is one of the Then use the fully connected layer and residual structure to
core issues of the ViT model pruning task. generate the output:
In order to solve the above problems, this paper designs
a pruning method. This paper names this pruning method Y = X + F Cout (Attention (Q, K, V )) (4)
as Knowledge-distillation-integrated Pruning (PDIP).The main where Xrefers to the input tokens, F Cout (.) refers to the
contributions of this paper are as follows: 1) Introduce fully connected layer that converts the output features of the
knowledge distillation in the ViT pruning process. Distillation attention module into subsequent outputs.
tokens are designed to achieve the interaction between the The operation of MLP is relatively simple, that is, the output
teacher model and the student model. 2) The importance score of the residual structure of the previous part is directly input
learning module is designed in the ViT pruning process. The to the fully connected layer with 2 layers:
importance score learning module is used to evaluate each
dimension of the pruned parameter matrix to ensure that the Y = X + F Cout (Attention (Q, K, V )) (5)
pruned dimension is the unimportant part of the model. 3)
The pruning process of pruning and distillation is designed. where F C1 (.), F C2 (.) respectively represent the two fully
The combination of pruning and knowledge distillation is connected layers in the MLP module.
achieveded by designing the process of pruning-distilling-
B. Network Pruning
repruning.
It is well known that pruning can effectively reduce
II. R ELATED W ORK the cost of deep network inference and according to the
A. The Transformer Architecture pruning positioncan it can be roughly categorized into two
Transformer was originally introduced by Vaswani [16] and groups:unstructured pruning and structured pruning. Unstruc-
applied to natural language processing tasks. In recent years, tured pruning mainly removes insignificant weight elements
more and more image classification tasks have been inspired according to specific standards such as weight magnitude
by and improved due to its powerful global feature extraction [3,13], gradient [4] and hessian [5]; And structured pruning
capability, such as Squeeze and Excitation, Split-Attention by remove model sub-structures, e.g., channels [6,7] in con-
Networks and Selective Kernel. Later, The ViT and some of volution neural network or attention heads [15], feedforward
its variants were applied to various image classification bench- network [14] in ViT.
marks and able to outperform convolution neural networks.
The typical ViT model structure and some of its variants in- C. Knowledge Distillation
clude the following parts: Multi-Head Self-Attention (MHSA), Knowledge Distillation, introduced by Hinton [9], is aimed
Multi-Layer Perceptron (MLP), normalization layer, activation to enable the student model leverages “soft” labels which
function, etc. is from a well-performed teacher network to achieve higher
First, The input image is decomposed into a batch of N accuracy. And the reason why the student model returned
patches of a fixed size of 16×16×3 dimensions, For example, to a higher accuracy rate is that Knowledge Distillation can
when a size of 224×224×3 image is converted to tokens of introduce inductive biases [10,11,12] to the student model.
16×16×3 dimension, N is equal to 196. Then MHSA uses We introduce the knowledge distillation to the ViT model by
self-attention mechanism to realize the interaction between designing distillation tokens. By embedding the distillation
tokens. Specifically, the input tokens are converted into three tokens into the input tokens and interacting with them, the
matrices Q, K and V through the fully connected layer, as ViT model is used as the student model to learn the output of
follows: the CNN model to obtain the knowledge of the teacher model.
Q = F CQ (X) , K = F CK (X) , V = F CV (X) (1) III. KNOWLEDGE - DISTILLATION - INTEGRATED PRUNING
where F C(.) represents the fully connected layer and X In order to solve the problem that pruning cannot be
represents the input tokens performed directly due to the low redundancy of model
Then use the self-attention mechanism to model and calcu- parameters, we propose a ViT pruning method that introduces
late the connection between tokens, as follows: knowledge distillation, which avoids the performance degra-
√
Attention (Q, K, V ) = sof tmax QK T / d V (2) dation of the model after pruning the model.
In addition, some important parts of the model may be
where Attention(.) represents the features output by the self- cut off during pruning, resulting in irreversible loss of model
attention module, Q, K, V represent the features output by the performance. In response to this problem, we design an impor-
input tokens through three matrices Q, K, V , drepresents the tance score learning module to guide the pruning work, and
dimension of the input tokens, and Sof tmax(.) represents an determine the removal of the pruning work are all unimportant
activation function, as follows: parts of the model.
exp(fi ) Finally, the pruning process of pruning and distillation is
Sof tmax (fi ) = k (3)
P designed to realize the combination of pruning and knowledge
exp(fj )
j=1
distillation.

211
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.
A. Pruning Location Analysis After getting the query matrix Q, the key-value matrix K
First analyze the parameters of the ViT model. In order to and the value matrix V, the multi-head attention is calculated:
obtain the output features in (1), the tokens input by the model
√
Attentioni (Q, K, V ) = Sof tmax QK T / d V (8)
first go through three matrices Q, K and V . This calculation’s
Floating Point Operations (FLOPs) are 3×n×d×(d+(d−1)). Attentiontotal (Q, K, V ) =
Considering that the fully connected layer has a bias calcula- (9)
[Attention1 (Q, K, V ) ; ...; Attentionh (Q, K, V )]
tion at the end, The FLOPs for this part are 3 × n × d × (2d).
Then Q, K and V is calculated by (2), the FLOPs of this part where Attentioni (.) represents the i-th self-attention module
are 4n2 d. in multi-head attention, Attentiontotal (.) represents the con-
After that, as shown in (4) above, the self-attention matrix catenation of the output results of h self-attention modules, and
of dimension (n, d) is dot-multiplied with the fully connected h represents the number of heads in the multi-head attention
layer matrix of dimension (d, d), FLOPs are n × d × (2d), and module.
then the obtained matrix and the input tokens Adding together, In order to solve the problems of performance degradation
FLOPs are n × d. In conclusion, the FLOPs of the ViT model and gradient disappearance during multi-layer model training,
in the MHSA part are 8nd2 + 4n2 d. the output of the multi-head attention module will be cal-
After obtaining the output Y of the MHSA part, Y is culated through the residual structure and the normalization
input to the MLP module for operation, as shown in (5), layer:
considering that the number of hidden layer dimensions of the
Attentionoutput (Q, K, V ) =
fully connected layer is generally 4 times the input dimension, (10)
so hidden The dimension of the layer is 4d. It can be obtained LN (Attentiontotal (Q, K, V ) + Temb )
by calculation: this part’s FLOPs are 16nd2 . whereLN denotes the normalization layer, Attentionoutput (.)
As for the operation of the normalization layer and the denotes the final output of the multi-head attention module,
activation function layer, the input is only calculated once, and is also the input of the subsequent MLP module. The
and its FLOPs are n × d. Compared with MHSA and MLP, it MLP module is actually composed of two fully connected
can be ignored. Therefore, the pruning work of the model is layers, and the output of the MLP module is represented by
mainly for the MHSA and MLP parts. Z:
B. Distillation Token Z = LN
The traditional ViT model will put a given image X ∈ (11)
(Attentionout + F C2 (F C1 (Attentionout )))
RH×W ×C , where H represents the height of the image, W
represents the width of the image, and C represents the number For a ViT model with layers L, the output feature vector
of channels of the image, which is converted into N patches. corresponding to the class token of the last layer is passed
After that, a fully connected layer is used to convert each patch through the classifier to obtain the predicted distribution:
into a patch token with a size of 16 × 16 × 3. In addition to Ppredict = sof tmax (F C (Zpredict )) (12)
the number of patch tokens of N , the ViT model will also
add a class token to interact with the real value to predict After getting the predicted distribution of the classifier,
the classification result. On this basis, this method adds a interact it with the true value of the sample, and use the cross-
distillation token to interact with the output of the teacher entropy loss function to get the loss function:
model to realize the teacher The learning of model knowledge, XX
and it also interacts with other tokens normally to achieve Lbase = − 1[yi =c] · log (Ppredict (yi = c)) (13)
global feature extraction. i∈N c∈C

Temb = [Tcls ; Tpatch ; Tdist ] (6) At the same time, the output feature vector corresponding to
the distillation token is extracted, and then a separate classifier
where Temb is the input of the ViT model introduced into is set for it to obtain the predicted distribution:
the distillation module, which will extract global feature in-
formation through the coding layer module of the ViT model, Pdist = Sof tmax (F C (Zdist )) (14)
Tpatch is the token converted from the input image, Tcls is the
classification token used for prediction in the traditional ViT Interact the prediction distribution obtained by the above
model, and Tdist is the distillation token introduced by this formula with the prediction results of the teacher model, and
method. also use the cross entropy loss function to construct the loss
After getting the model input Temb , perform self-attention function:
XX
calculation on it. The specific implementation method is to Ldist = − 1[yi =c] · log (Pdist (yi = c)) (15)
make Temb obtain the query matrix Q, the key value matrix i∈N c∈C
K and the value matrix V through the fully connected layer: where c is the label output by the teacher model after the
Q = F CQ (Temb ) , K = F CK (Temb ) , softmax function, the type is the same as the real value but
(7) the value is slightly different.
V = F CV (Temb )

212
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.
Finally, α is used as a hyperparameter to balance distillation where H is the Hessian matrix equivalent to the second
loss and conventional loss, and the loss function is defined as: derivative of L(ω ∗ ; X; y) to ω ∗ , and the optimal solution of
the loss function (21) is:
Lglobal = aLdist + (1 − a) Lbase (16)
α
In experiments, we found that the final accuracy of the ωi = sign (ωi∗ ) max |ωi∗ | − ,0 (22)
Hi,i
model obtained by the proposed method as the student model
can not only surpass that before distillation, but even surpass It can be clearly seen that the solution of the loss function
the teacher model. This shows that the student model can is sparse after the introduction of the L1 norm. After obtaining
learn the unique inductive bias of the CNN model by distilling the sparse importance score learning module, we also get the
the CNN model, that is, the unique parameter sharing ability importance score of each dimension in the pruning layer, and
of the CNN and the local receptive field, thereby improving then sort the dimensions according to the score, and regard the
the student model’s ability to deal with image problems. In dimension with low score as the unimportant part and Drop
the proposed method, if only the output of the judger of the it. The workflow of the module which is after sparse training
teacher model is learned, it is easy to cause overfitting. In is as follows:
order to solve this problem, we introduce soft distillation into First set the importance score of the module output to a,
the proposed method, and control the influence of the teacher get a threshold of γ according to the pre-defined pruning rate,
model on the student model by adjusting the distillation and set the a below the threshold γ to zero, and the a above
temperature T. The formula of soft distillation is as follows: the threshold γ to 1. to get discrete a∗ .
Multiply the Temb from (6) by the discrete a∗ to get the
Lsof t = T 2 KL trimmed model input, named Temb ∗
:
(17)
(Sof tmax (Zpredict /T ) , Sof tmax (Zteacher /T ))
Temb ∗ = P rune(Temb ) (23)
where KL(.) represents the KL divergence formula, the
proposed method sets the hyperparameter β to control the Due to the unstructured pruning that has been completed
influence of soft distillation on the overall distillation loss of in Temb , some of the weights of the fully connected layers
the model, so the loss function of the final model is expressed F CQ , F CK and F CV that operate with it in (7) also lose their
as: meaning (the reason is that the essence of the fully connected
layer operation is actually a matrix dot product operation, if
Lglobal = aLdist + (1 − a) Lbase + βLsof t (18) some weights of the input of the fully connected layer are
C. Importance Score Learning Module set to zero, it is equivalent to setting a row or a column of
the matrix to zero), so the meaningless weight of the fully
In order to ensure that the pruned dimension is the unim- ∗
connected layer can be directly set to zero to get F CQ ∗
, F CK ,
portant or even redundant part of the model, this method ∗
F CV , and then connect The following operation:
introduces the importance score judgment module to evaluate
the importance of the dimension. The specific method is to Q∗ = F CQ ∗ (Temb ∗ ) , K ∗ = F CK ∗ (Temb ∗ ) ,
(24)
add a fully connected layer before the layer to be pruned, the V ∗ = F CV ∗ (Temb ∗ )
importance score of the corresponding dimension is obtained
by learning the parameters of this layer. The dimension with After the three matrices of Q∗ , K ∗ and V ∗ are ob-
high score is reserved and deleted otherwise. In order to tained, the operations of (8) to (10) are performed to obtain
prevent the importance scores from being very close to make Attentionoutput (Q∗ , K ∗ , V ∗ ), and then the importance score
it difficult to continue the pruning work, this method sparsely decision module located in the MLP part of the model is used
trains the parameters of this layer separately by adding the L1 to prune F C1 and F C2 of (11). Set the pruned fully connected
norm as a penalty item to the loss function as follows. First, layers to F C1∗ and F C2∗ . And the output is:
the loss function of adding the penalty term is as follows: Z ∗ = LN (Attentionoutput (Q∗ , K ∗ , V ∗ ) +
(25)
J (ω; X, y) = L (ω; X, y) + αΩ (ω) (19) F C2∗ (F C1∗ (Attentionoutput (Q∗ , K ∗ , V ∗ ))))
where X is the input image, y is the label corresponding to The pruned F C1 , F C2 in (25) represent the two fully
the image, ω is the parameter of the model, L is the initial connected layers of the MLP part of the model. Through
the above operations, this method realizes the pruning of the
Pn of the model, and Ω is the penalty term, Ω(ω) =
loss function
∥ω∥1 = i=1 |ωi |, let ω ∗ be the optimal solution of L, then: MHSA and MLP parts.
′
L (ω; X; y) = L (ω ∗ ; X; y) + L (ω ∗ ; X; y) (ω − ω ∗ ) + D. Pruning Strategy
1 ′′ ∗ 2
(20) Considering the structural complexity of the ViT model
L (ω ; X; y) (ω − ω ∗ )
2 itself and its low parameter redundancy, using the traditional
After deduction, we can get: training-pruning-finetuning pruning strategy will lead to irre-
1 versible performance loss of the model. Referring to the most
2
J (ω; X, y) = L (ω ∗ ; X, y) + H(ω − ω ∗ ) + α∥ω∥1 (21) advanced pruning strategy currently applied to CNN models,
2

213
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.
this method uses sparse distillation to train this strategy. token and distillation token are added to Tpatch as in (6) to
Specifically, it first initializes the importance scores of the form Temb .
pruned dimensions of the model, and then performs the di- After getting Temb , the model will input Temb to the impor-
mensions according to the scores obtained from initialization. tance score decision module, which will zero out the tokens
Sorting, and zeroing the unimportant parameters instead of with low importance scores and prevent it from interacting
completely deleting them to facilitate subsequent exploration with other tokens, since the unimportant tokens are zeroed and
of sparsity. Then, the distillation loss is introduced to perform here It does not participate in the calculation in the secondary
distillation training on the model. After the distillation training cycle, and the corresponding MHSA and MLP parts are also
is completed, the sparse strategy of the sparse training model is equivalent to being deleted.
∗ ∗
explored to obtain a new sparse model. After that, the sparse Set the pruned Temb to Temb . After Temb passes through
model is trained by distillation training and then repeat the equations (6) to (18), the loss function Lglobal will be obtained,
above operations. The specific process is as follows: and then use the loss function to update the model parameters
First, initialize the sparse ratio of each layer, let W = without the importance score learning module:
(W (1) , ..., W (L) ) be the parameter of the ViT model, and L
represent the total number of layers in the model. Each layer W = W − η · ∇W Lglobal (29)
is initialized with a sparse ratio of s = (s(1) ; ...; s(L) ), where where η represents the learning rate.
s(l) represents the ratio of layer l: In order to enforce sparsity of importance scores, impor-
tance score learning module’s loss function must be J in

(τmax + τmin ) t × 2π
s(l) = 1 + cos (26) (14), and then the loss function is used to update the model
4 Tend
parameters of the importance score decision module:
where τmax represents the preset maximum sparsity rate, τmin
represents the preset minimum sparsity rate, t represents the W = W − η · ∇W J (W ) (30)
number of iterations of the current sparse exploration and
is initialized to 0, and Tend represents the total number of After updating the model parameters, the model uses a new
iterations of the sparse exploration. pruning strategy for each layer of the model according to (21),
Considering that the importance score module has not been and performs pruning and pruning according to the importance
trained in the first iteration, the MHSA and MLP parts are first score decision module and distillation training on the model
trimmed according to the gradient strategy, and the importance parameters until the number of repetitions reaches the preset
score formula of the MHSA part is as follows: number cycle.
IV. E XPERIMENTS
(l,h) T ∂L X (l)
H = A(l,h) · (27) In order to compare experiments with other ViT pruning
∂A(l,h)
methods, the data set used in this chapter is ImageNet-1K.
where H (l,h) represents the importance score of the hth The training set and test set are divided as shown in Table 1.
attention head of the lth layer based on the gradient strategy.
The higher the score, the more important the attention head TABLE I
DATASET D ETAILS
is, X (l) represents the tokens input to the lth layer, and A(l,h)
represents the output features of the attention head matrix L(.) Dataset Name Number of Classes Training set Testing set
represents the cross-entropy loss function. ImageNet-1K 1000 1281167 500000
The importance score formula for the MLP part is similar
to the above formula: In the paper, the training batch size of all datasets is 16,
the optimizer is AdamW, the learning rate of the classifier is

∂L A(l,h)
W (l,h) = O(l,h)
T
· (28) initialized to 0.001, the learning rate of the backbone network
∂O(l,h) network is 0.0001, and the learning first sets a warm-up
learning rate of size 0.0001 for 30 cycles. After the 31st epoch
where W (l,h) represents the importance score of the hth using the initial learning rate, it will be multiplied by a factor
dimension of the lth layer of the MLP part which is based on of 0.1 every 8 epochs to continue to converge.
the gradient strategy. The higher the score, the more important Through simulation experiments, the method proposed in
the dimension is. A(l,h) indicates that the output of the MHSA this study achieves a classification accuracy of 82.19% on
part of the lth layer is the MLP part of the lth layer. Input, the dataset ImageNet-1K after pruning the model, while the
O(l,h) represents the output features of the MLP part. ViT-B/16 model used in the experiment is only 77.9% on
After the unstructured pruning is completed, the model the dataset ImageNet-1K before pruning. That is to say, in
is updated using the distillation loss function. The model the case of pruning the model parameters to achieve model
parameter update process is as follows: First, a given input compression, the performance of this method not only has
image is set, and a series of tokens are obtained through no loss but is improved by 4.29%, which has surpassed most
transformation and set as Tpatch , and then the classification existing research methods. The experimental results and the

214
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.
comparison results with the existing research methods of the are not important are pruned. At the same time, we have added
corresponding dataset are shown in Table 2. The table gives the distillation to the input part of the model, and have realized the
backbone network used and the classification accuracy, which learning of teacher model knowledge by interacting with the
are reproduced according to the official published results output results of the teacher model through distillation. Finally,
of the original paper or the same experimental environment the process of pruning while distillation has been designed to
as this experiment. Overall, the performance of the method realize the combination of pruning and knowledge distillation.
of introducing knowledge distillation into pruning has been
R EFERENCES
greatly improved compared with most pruning or knowledge
distillation methods. It can be seen that the introduction of [1] Han K, Xiao A, Wu E, et al. Transformer in transformer[J]. Advances
in Neural Information Processing Systems, 2021, 34.
knowledge distillation into the field of pruning is feasible and [2] Wang W, Xie E, Li X, et al. Pyramid vision transformer: A versatile
effective. backbone for dense prediction without convolutions[C]//Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2021: 568-
TABLE II 578.
I MAGE N ET-1K DATASET CLASSIFICATION C OMPREHENSIVE COMPARISON [3] Song Han, Jeff Pool, John Tran, and William Dally. Learning both
weights and connections for efficient neural network. In C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
Other Methods Backbone Network Accuracy(Top1) Params FLOPs in Neural Information Processing Systems 28, pages 1135–1143.Curran
Han et al. [1] TNT/B 83.6% 65.6M 42.3B
Wang et al. [2] PVT 81.7% 61.4M 39.6B
Associates, Inc., 2015.
Li et al. [8] T2T-ViT 82.3% 64.1M 41.3B [4] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan
Hugo et al. [11] DeiT/B 81.9% 86M 55.5B Kautz. Importance estimation for neural network pruning. In Proceed-
Stéphane et al. [12] ConViT 82.4% 86M 55.5B
Yang et al. [13] NViT 83.1% 86M 55.5B ings of the IEEE/CVF Conference on Computer Vision and Pattern
Zhu et al. [14] VTP 80.7% 48M 31.0B Recognition, pages 11264–11272, 2019.
He et al. [15] SPViT 81.6% 62.3M 40.2B [5] LeCun, Yann, John S. Denker, and Sara A. Solla. 1989. “Optimal Brain
PDIP(ours) ViT-B/16 82.2% 56.5M 36.4B
Damage.” In Proceedings of the 2nd International Conference on Neural
Information Processing Systems (Nips), 2:598–605.1.
[6] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan,
In order to prove that changing the model pruning strategy and Changshui Zhang.Learning efficient convolutional networks through
during training will not affect the final convergence of the network slimming. In Proceedings of the IEEE International Conference
model, this paper also analyzes the degree of convergence on Computer Vision, pages 2736–2744, 2017.
[7] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerat-
when training the model. As shown in Fig.2 below, although ing very deep neural networks. In Proceedings of the IEEE International
the model fluctuated slightly during the training process, the Conference on Computer Vision, 2017.
overall trend of the accuracy rate was a steady increase [8] Yuan L, Chen Y, Wang T, et al. Tokens-to-token vit: Training vi-
sion transformers from scratch on imagenet[C]//Proceedings of the
and reached a maximum value around the 300th epochs and IEEE/CVF International Conference on Computer Vision. 2021: 558-
eventually stabilized at this value. To train the model we train 567.
on an RTX3090 graphics card for about 630 hours. Although [9] Geoffrey E. Hinton, Oriol Vinyals, and J. Dean. Distilling the knowledge
in a neural network. arXiv preprint arXiv:1503.02531, 2015.
the training cost is huge, the inference speed of the model [10] Samira Abnar, Mostafa Dehghani, and Willem Zuidema. Transfer-
after pruning is significantly improved due to the reduction of ring inductive biases through knowledge distillation. arXiv preprint
FLOPs and a large amount of memory space is saved due to arXiv:2006.00555, 2020.
[11] Touvron H, Cord M, Douze M, et al. Training data-efficient image trans-
a large number of parameters being pruned. formers & distillation through attention[C]//International Conference on
Machine Learning. PMLR, 2021: 10347-10357.
[12] d’Ascoli S, Touvron H, Leavitt M L, et al. Convit: Improving vision
transformers with soft convolutional inductive biases[C]//International
Conference on Machine Learning. PMLR, 2021: 2286-2296.
[13] Yang H, Yin H, Molchanov P, et al. Nvit: Vision transformer compres-
sion and parameter redistribution[J]. arXiv preprint arXiv:2110.04869,
2021.
[14] Zhu M, Tang Y, Han K. Vision Transformer Pruning[J]. arXiv preprint
arXiv:2104.08500, 2021.
[15] He H, Liu J, Pan Z, et al. Pruning self-attentions into convolutional
layers in single path[J]. arXiv preprint arXiv:2111.11802, 2021.
[16] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J].
Advances in neural information processing systems, 2017, 30.

Fig. 2. Accuracy over 300 epochs.

V. C ONCLUSION
In this paper, we have proposed a new evaluation index of
importance score to make sure that all parts of the model that

215
Authorized licensed use limited to: BEIJING UNIVERSITY OF POST AND TELECOM. Downloaded on November 23,2024 at 04:59:22 UTC from IEEE Xplore. Restrictions apply.

Survey Pruning 1 - 2022-Methods For Pruning Deep
No ratings yet
Survey Pruning 1 - 2022-Methods For Pruning Deep
21 pages
Hrel: Filter Pruning Based On High Relevance Between Activation Maps and Class Labels
No ratings yet
Hrel: Filter Pruning Based On High Relevance Between Activation Maps and Class Labels
13 pages
Pyramid Vision Transformer: A Versatile Backbone For Dense Prediction Without Convolutions
No ratings yet
Pyramid Vision Transformer: A Versatile Backbone For Dense Prediction Without Convolutions
15 pages
Lightweight
No ratings yet
Lightweight
23 pages
1580 Rethinking The Value of Networ
No ratings yet
1580 Rethinking The Value of Networ
21 pages
5229 Linearly Decomposing and
No ratings yet
5229 Linearly Decomposing and
25 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Edge Ai
No ratings yet
Edge Ai
8 pages
2022 - ViTAEv2 - Zhang Et Al - Arxiv
No ratings yet
2022 - ViTAEv2 - Zhang Et Al - Arxiv
22 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
No ratings yet
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
27 pages
LLM Knowledge Distillation
No ratings yet
LLM Knowledge Distillation
17 pages
Understanding Robustness of Transformers For Image
No ratings yet
Understanding Robustness of Transformers For Image
23 pages
Mmprune4U: Regularizing Multimodal Feature Distortion in Weight Pruning For Deep Neural Network Compression
No ratings yet
Mmprune4U: Regularizing Multimodal Feature Distortion in Weight Pruning For Deep Neural Network Compression
15 pages
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
No ratings yet
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
11 pages
Pprid 104
No ratings yet
Pprid 104
11 pages
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
No ratings yet
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
26 pages
Target Aware Network Architecture Search
No ratings yet
Target Aware Network Architecture Search
9 pages
Applsci 12 11184
No ratings yet
Applsci 12 11184
18 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
Scalable Vision Transformers With Hierarchical Pooling
No ratings yet
Scalable Vision Transformers With Hierarchical Pooling
11 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
BinaryViT：高效、精确的二值ViT
No ratings yet
BinaryViT：高效、精确的二值ViT
12 pages
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
No ratings yet
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
14 pages
AE-ViT: Token Enhancement For Vision Transformers Via CNN-based Autoencoder Ensembles.
No ratings yet
AE-ViT: Token Enhancement For Vision Transformers Via CNN-based Autoencoder Ensembles.
12 pages
Snip: S - N P C S: Ingle Shot Etwork Runing Based On Onnection Ensitivity
No ratings yet
Snip: S - N P C S: Ingle Shot Etwork Runing Based On Onnection Ensitivity
15 pages
视觉Transformer的累积空间知识蒸馏
No ratings yet
视觉Transformer的累积空间知识蒸馏
11 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Good Note - ViT
No ratings yet
Good Note - ViT
13 pages
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
No ratings yet
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
10 pages
SNP: Structured Neuron-Level Pruning To Preserve Attention Scores
No ratings yet
SNP: Structured Neuron-Level Pruning To Preserve Attention Scores
13 pages
LLM Pruning Nvidia
No ratings yet
LLM Pruning Nvidia
9 pages
LLM Pruning and Distillation in Practice: The Minitron Approach
No ratings yet
LLM Pruning and Distillation in Practice: The Minitron Approach
11 pages
Vision Transformer Pruning: Mingjian Zhu Yehui Tang Kai Han
No ratings yet
Vision Transformer Pruning: Mingjian Zhu Yehui Tang Kai Han
4 pages
2021pyramid Vision Transformer (PVT)
No ratings yet
2021pyramid Vision Transformer (PVT)
11 pages
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
No ratings yet
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
17 pages
MLSys 2020 What Is The State of Neural Network Pruning Paper
No ratings yet
MLSys 2020 What Is The State of Neural Network Pruning Paper
18 pages
Ilhan Resource-Efficient Transformer Pruning For Finetuning of Large Models CVPR 2024 Paper
No ratings yet
Ilhan Resource-Efficient Transformer Pruning For Finetuning of Large Models CVPR 2024 Paper
10 pages
2022 - PVT v2
No ratings yet
2022 - PVT v2
10 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Learning To Prune Filters in Convolutional Neural Networks: Qianguih, Uneumann @usc - Edu Suya - You.civ@mail - Mil
No ratings yet
Learning To Prune Filters in Convolutional Neural Networks: Qianguih, Uneumann @usc - Edu Suya - You.civ@mail - Mil
10 pages
Vision Transformer For Small-Size Datasets
No ratings yet
Vision Transformer For Small-Size Datasets
11 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
20222-Article Text-24235-1-2-20220628
No ratings yet
20222-Article Text-24235-1-2-20220628
9 pages
Beyer Knowledge Distillation A Good Teacher Is Patient and Consistent CVPR 2022 Paper
No ratings yet
Beyer Knowledge Distillation A Good Teacher Is Patient and Consistent CVPR 2022 Paper
10 pages
Better Vision Transformer Via Token Pooling and Attention Sharing
No ratings yet
Better Vision Transformer Via Token Pooling and Attention Sharing
13 pages
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
No ratings yet
Chen Visformer The Vision-Friendly Transformer ICCV 2021 Paper
10 pages
Generative AI
No ratings yet
Generative AI
10 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
No ratings yet
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
17 pages
Benchmarking Detection Transfer Learning With Vision Transformers
No ratings yet
Benchmarking Detection Transfer Learning With Vision Transformers
9 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
CV Unit 4
No ratings yet
CV Unit 4
60 pages
Channel Pruning For Accelerating Very Deep Neural Networks
No ratings yet
Channel Pruning For Accelerating Very Deep Neural Networks
9 pages
Group 16 - Green - AI - Poster
No ratings yet
Group 16 - Green - AI - Poster
1 page
Unit 4a
No ratings yet
Unit 4a
83 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
Mini Project Document Duplicate-3
No ratings yet
Mini Project Document Duplicate-3
22 pages
Unec 1700728516
No ratings yet
Unec 1700728516
105 pages
Ad3461 ML Manual
No ratings yet
Ad3461 ML Manual
34 pages
Computer Vision With Keras
No ratings yet
Computer Vision With Keras
67 pages
Mmte 007
No ratings yet
Mmte 007
5 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Nazenin Ahin Tez
No ratings yet
Nazenin Ahin Tez
78 pages
Ch5 - Review Question
No ratings yet
Ch5 - Review Question
3 pages
4 - Mcq-Ann-Ann-Quiz - Selected
No ratings yet
4 - Mcq-Ann-Ann-Quiz - Selected
13 pages
Few-Shot Classification of Tabular Data With Large Language Models
No ratings yet
Few-Shot Classification of Tabular Data With Large Language Models
33 pages
Goj 2250
No ratings yet
Goj 2250
18 pages
Nvidia Ai + LLM Study Plan
No ratings yet
Nvidia Ai + LLM Study Plan
4 pages
Take Test - Final Exam Preparation - Artificial ..
No ratings yet
Take Test - Final Exam Preparation - Artificial ..
11 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
Types of Networks
No ratings yet
Types of Networks
31 pages
Syllabus
No ratings yet
Syllabus
2 pages
Discussion 4 - SVM Loss and Regularization - Annotated
No ratings yet
Discussion 4 - SVM Loss and Regularization - Annotated
19 pages
KBNET
No ratings yet
KBNET
15 pages
Artificial Intelligence and Deep Learning in Video Games A Brief Review
No ratings yet
Artificial Intelligence and Deep Learning in Video Games A Brief Review
5 pages
5 2 Algoritam Ucenja Graficki BP
No ratings yet
5 2 Algoritam Ucenja Graficki BP
23 pages
Blue Print - AI - Std10 - Preboard-1 - 24-25
No ratings yet
Blue Print - AI - Std10 - Preboard-1 - 24-25
2 pages
Gender and Age Detection
No ratings yet
Gender and Age Detection
16 pages
Classification and Prediction: Data Mining Concepts and Techniques
No ratings yet
Classification and Prediction: Data Mining Concepts and Techniques
18 pages
Speech Emotion Recognition
No ratings yet
Speech Emotion Recognition
6 pages
Project Presentation Viva Question and Answers
No ratings yet
Project Presentation Viva Question and Answers
4 pages
Role of Optimizer in Neural Network
No ratings yet
Role of Optimizer in Neural Network
2 pages
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet

A Knowledge Distillation Integrated Pruning Method For Vision Transformer

Uploaded by

A Knowledge Distillation Integrated Pruning Method For Vision Transformer

Uploaded by

2022 21st International Symposium on Communications and Information Technologies

Beijing 100876, China

{xubangguo, zhangtiankui}@bupt.edu.cn, [email protected], [email protected]

978-1-6654-9851-7/22/$31.00 ©2022 IEEE 210

Fig. 2. Accuracy over 300 epochs.

You might also like