Deep Model Compression Based On The Trai
Deep Model Compression Based On The Trai
S.H.Shabbeer Basha* , Mohammad Farazuddin* , Viswanath Pulabaigari* , Shiv Ram Dubey* , and Snehasis Mukherjee**
*
Indian Institute of Information Technology Sri City, Chittoor, India.
**
Shiv Nadar University, Gautam Budh Nagar, Uttar Pradesh, India.
Abstract— Deep Convolutional Neural Networks (DCNNs) in architectures with compressed design. Most of the work
have shown promising results in several visual recognition prob- on deep model compression can be broadly categorized into
arXiv:2102.00160v1 [cs.CV] 30 Jan 2021
lems which motivated the researchers to propose popular archi- four classes. The first class [3], [9] of methods are focused on
tectures such as LeNet, AlexNet, VGGNet, ResNet, and many
more. These architectures come at a cost of high computational introducing sparsity into the model parameters. The second
complexity and parameter storage. To get rid of storage and class [35], [42], [41] of methods are aimed at quantization
computational complexity, deep model compression methods based pruning. The third class of methods are dedicated to
have been evolved. We propose a novel “History Based Filter compressing the networks using filter decomposition [30],
Pruning (HBFP)” method that utilizes network training history [6], [31]. The fourth class of methods are focused on pruning
for filter pruning. Specifically, we prune the redundant filters
by observing similar patterns in the filter’s ℓ1 -norms (absolute unimportant filters [1], [5], [50]. The proposed training
sum of weights) over the training epochs. We iteratively prune History Based Filter Pruning (HBFP) method belongs to the
the redundant filters of a CNN in three steps. First, we train the fourth category.
model and select the filter pairs with redundant filters in each In general, filter pruning methods require some metric to
pair. Next, we optimize the network to increase the similarity calculate the importance of a filter. Many metrics have been
between the filters in a pair. It facilitates us to prune one
filter from each pair based on its importance without much proposed to calculate the importance of a filter. For instance,
information loss. Finally, we retrain the network to regain the Abbasi et al. [1] employed a brute-force technique to prune
performance, which is dropped due to filter pruning. We test the filters sequentially that contribute less to the classifi-
our approach on popular architectures such as LeNet-5 on cation performance. However, the brute-force technique is
MNIST dataset and VGG-16, ResNet-56, and ResNet-110 on inefficient while dealing with large neural networks, such
CIFAR-10 dataset. The proposed pruning method outperforms
the state-of-the-art in terms of FLOPs reduction (floating- as AlexNet [22] and VGG-16 [43]. Li et al. [29] pruned the
point operations) by 97.98 %, 83.42 %, 78.43 %, and 74.95 unimportant filters based on their ℓ1 norm. They assumed that
% for LeNet-5, VGG-16, ResNet-56, and ResNet-110 models, the filters with a high ℓ1 norm are most probably important
respectively, while maintaining the less error rate. and will have a larger influence on the relevancy of the
generated feature map.
I. I NTRODUCTION
In this paper, we introduce a novel method for pruning the
In recent years, Convolution al Neural Networks (CNN) redundant filters based on the training history. We iteratively
have gained significant attention from researchers in the field prune redundant filters from a CNN in three stages. First,
of computer vision due to its impeccable performance in we select some (M % of) filter pairs as redundant for which
several tasks including classification and detection [25]. The the sum of the absolute value of the difference between
wide usage of deep CNNs in numerous applications creates the filter’s ℓ1 norm over training epochs is minimum. Next,
an increasing demand for memory (parameter storage) and instead of pruning the filters directly, we reduce the dif-
computation. To address this key issue, various attempts have ference between the filter’s ℓ1 norm in respective epochs
been made in the literature. One such attempt focuses on (which we call optimization) to minimize the complimentary
training the deep CNNs with limited data [8], [23], [48]. information loss and then prune one filter from each pair
Another line of research has shown better performance in based on its magnitude. Finally, we fine-tune (re-train) the
reducing the overhead of computational power and memory network to gain the classification performance, which is
storage, which mainly focuses on model compression by decreased due to filter pruning. The high-level view of the
pruning connections [9], [47] and filters [5], [15], [29]. proposed method is outlined in Fig.1.
Typically increasing the size (requires more storage space) The remaining paper is organized as follows: the related
of a deep neural network makes deploying the model difficult works are compiled in Section II; Preliminaries and notations
on low-end (resource constraint) devices such as mobile are presented in Section III; The proposed History Based
devices and embedded systems. For example, VGG-16 [43] Filter Pruning (HBFP) method is explained in Section IV;
has 138.34 million parameters which require storage space The experimental pruning results are examined in Section V
of more than 500 MB. To reduce the resource overhead of along with the analysis and discussion; Finally, the conclud-
deep CNNs, many attempts have been made to prune less- ing remarks are made in Section VI along with the future
important connections and filters of a CNN which results directives.
Fig. 1. Initially, we start with a heavier model, after identifying the redundant filters, instead of pruning the filters naively at this stage, we optimize the
model to minimize the complimentary information loss that will occur due to filter pruning. After optimizing the network with the custom regularizer, one
filter from each filter pair is pruned. Later the network is re-trained to regain the classification performance. This process is repeated iteratively to obtain
a light-weight CNN.
method does not require the support of any additional soft- B. Notations
ware/hardware. In brief, the contributions of this research are Consider a convolution layer Li of a CNN, which has n
summarized as follows, filters, i.e., {f 1 , f 2 , f 3 , ..., f n }. Any two filters f i , f j belong
• We propose a novel method for pruning filters from to the k th layer of a CNN are denoted as fki , fkj , respectively.
convolutional layers based on the training history of a For instance, if the filter fki is of dimension 3 × 3 then
deep neural network. It facilitates the identification of it consists of 9 parameters, i.e., {wk,1 i i
, wk,2 i
, wk,3 i
, ..., wk,9 }.
stable, redundant filters throughout the training that can Here ‘k’ represents the index of convolution layer and ‘i’
have a negligible effect on performance after pruning. denotes the filter index.
• We introduce an optimization step (custom regularizer) Initially, our method computes the ℓ1 norm of each filter
to reduce the information loss incurred due to filter within the same convolution layer using the formula given
pruning. It is achieved by increasing the redundancy in Eq. 4. For example the ℓ1 norm of filter fki is computed
level of selected filters for pruning. as follows,
• To establish the significance of the proposed prun- X9
ing method, experiments are conducted on benchmark ℓ1 (fki ) = kfki k1 = i
wk,p (4)
CNNs like LeNet-5 [26], VGG-16 [43], ResNet-56, p=1
and ResNet-110 [11]. The validation of the proposed where p is the number of parameters in the filter fki . Here,
pruning method is performed over two benchmark clas- we assume the filter fki is of dimension 3 × 3.
sification datasets, including MNIST and CIFAR-10.
Next, we compute the absolute difference between the ℓ1
Also, note that the other pruning methods such as low-rank norms of each filter pair at every epoch as shown in Eq.
approximation methods can be integrated into our method 5. The absolute difference between ℓ1 (fki ) and ℓ1 (fkj ) is
(to decompose dense layers) to obtain a better-compressed computed as follows,
model.
III. P RELIMINARIES AND N OTATIONS df i ,f j (t) = ℓ1 (fki ) − ℓ1 (fkj ) . (5)
k k
Here, we discuss the background details like computing Here, ‘t’ indicates the epoch number. Then, this difference
the Floating Point Operations (FLOPs) involved in a CNN is summed over all the epochs and denoted by Df i ,f j . The
k k
and the notations used in this paper. sum of differences of filter pairs over the training epochs is
A. Calculating Floating Point Operations considered as the metric for filter pruning and given as,
In order to compare the various CNN models, we primarily N
X
use the accuracy on the validation set as a metric to compare Df i ,f j = df i ,f j (t) (6)
k k k k
which model is the most accurate. However, when there are t=1
constraints on computational resources, we use the number of where ‘N ’ indicates the maximum number of epochs used
Floating Point Operations (FLOPs) as a metric to compare for training the networks.
Fig. 2. The overview of the proposed pruning method. At each iteration, we compute the sum of differences Df i ,f j of all filter pairs. In filter selection
k k
stage, we choose M % of filter pairs for which Df i ,f j is minimum. Here, filters f11 and f14 have the least Df i ,f j (absolute point-to-point difference),
k k k k
so we select them as a pair and discard one from the pair with least magnitude. The feature maps corresponding to the pruned filters get dropped.
Our method computes the sum of differences of pairs minimize the difference between the filter’s norms (belong
Df i ,f j value for nC2 filter pairs (assuming a convolution to a filter pair) at each epoch. After optimization, one filter
k k
layer has n filters). The total difference Df i ,f j which is from each pair is discarded (pruned). From a filter pair, we
k k
summed over all the epochs is used for filter selection. More prune one filter based on the criteria employed in [29], i.e.,
concretely, the top M % of the filter pairs (from nC2 pairs the filters with a higher ℓ1 norm is more important. Finally,
in the same layer) with the least Df i ,f j value are formed to recover the model from the performance drop which is
k k
as redundant filter pairs which represent roughly the same incurred due to filter pruning, we retrain the pruned model.
information. The overview of the proposed HBFP method This process corresponds to one iteration of the proposed
is presented in Fig. 2. The proposed History Based Filter pruning method which is demonstrated in Fig.1. This whole
Pruning (HBFP) method involves two key steps, i) Filter process is repeated until the model’s performance drops
selection and ii) Optimization which are presented in the below a certain threshold. Our main contributions are made
next section. specifically in the filter selection and optimization steps.
IV. P ROPOSED T RAINING H ISTORY BASED F ILTER A. Filter Selection
P RUNING (HBFP) M ETHOD In the beginning, our method takes a heavy-weight CNN
Our pruning method aims at making a deep neural network and selects the top M % of filter pairs from each convolution
computationally efficient. This is achieved by pruning the layer for which the difference D computed in Eq. 6 is
redundant filters, i.e., removing of the filters whose removal minimum. More concretely, the filter pair having the least
do not cause much hindrance to the classification perfor- Df i ,f j value is formed as the first redundant filter pair.
k k
mance. We identify the redundant filters by observing the Similarly, the next pair having the second least Df i ,f j value
k k
similar patterns in the weights (parameters) of filters during is formed as another filter pair and so on. Likewise, in each
the network training, which we refer to as the network’s iteration, M % of filter pairs from each convolution layer
training history. We start with a pre-trained CNN model. are selected as redundant which are further considered for
During the network training, we observe and form pairs optimization. Let us define two more terms that are used in
of filters whose weights follow a similar trend over the the proposed pruning method. “Qualified-for-pruning (Qi )”
training epochs. In each iteration, we pick some (M %) of and “Already Pruned (Pi )”. Here Qi represents the set of
the top filter pairs with high similarity (based on the D value filter pairs that are ready (selected) for pruning from a
computed using Eq. 6, low D value means high similarity). convolution layer Li . Whereas, Pi indicates the filters that
Instead of pruning the filters at this stage, we increase the are pruned from the network (one filter from each pair of
similarity between the filters that belong to the selected filter Qi ). Hence, if M % of filter pairs are chosen in Qi , then
pairs by introducing an optimization step. This optimization |P | = M %, i.e., from each convolution layer M % of filters
is achieved with a custom regularizer whose objective is to are pruned in every iteration by the proposed method.
TABLE I
B. Optimization
T HE PRUNING RESULTS OF L E N ET-5 ON MNIST DATASET. T HE ROWS
Singh et al. [44] reported that by introducing an optimiza- CORRESPONDING TO HBFP-I, HBFP-II, HBFP-III, AND HBFP-IV
tion with a custom regularizer decreases the information loss INDICATE THE PRUNING RESULTS OF THE LAST FOUR ITERATIONS OF
incurred due to filter pruning. Motivated by this work, we THE PROPOSED METHOD . H ERE * INDICATES THE REPRODUCED
add a new regularizer to the objective function to reduce RESULTS . T HE RESULTS ARE ARRANGED IN THE INCREASING ORDER OF
the difference df i ,f j (t) between the filters belonging to PRUNED FLOP S REDUCTION %.
k k
Qi at each epoch during network training, i.e., increasing
the similarity between the filters belong to the same pair. Method r1, r2 Top-1% Error Pruned FLOPs
Baseline 20,50 0.83 4.4M (0.0%)
Let C(W ) be the objective function (Cross-entropy loss Sparse-VD [38] - 0.75 2.0M (54.34%)
function) of the deep convolutional neural network with W SBP [40] - 0.86 0.41M (90.47%)
as the network parameters. To minimize the information loss SSL-3 [46] 3,12 1.00 0.28M (93.42%)
and to maximize the regularization capability of the network, HBFP-I (Ours) 4,5 0.98 0.19M (95.57%)
GAL [34] 2,15 1.01 0.1M (95.6%)
we employ a custom regularizer to the objective function,
HBFP-II (Ours) 3,5 1.08 0.15M (96.41%)
which is given as follows: Auto balanced 3,5 2.21 0.15M (96.41%)
X CFP [44] 2,3 1.77 0.08 (97.98%)
C1 = exp df i ,f j (t) (7) CFP [44]∗ 2,3 2.61 0.08 (97.98%)
k k
fki ,fkj ∈Qi HBFP-III (Ours) 3,4 1.2 0.13M (96.84%)
HBFP-IV (Ours) 2,3 1.4 0.08M (97.98%)
where t denotes the epoch number and t ∈ 1, 2, 3, ..., N
(assuming we train the model for N epochs). With this new
regularizer, the final objective of the proposed HBFP method for LeNet-5 and ResNet-56/110. However, we empirically
is given by, observe that the value of λ as 0.8 gives better results for
VGG-16. We prune M % of the filters from each convolution
W = arg min C(W ) + λ ∗ C1 (8) layer simultaneously. The M % is considered as 10% for
W
LeNet-5, VGG-16. Whereas, for ResNet-56 and ResNet-
where λ is the regularizer term which is a hyperparameter. 110, we prune 2, 4, 8 filters from each convolution layer
Optimizing the Eq. 8 decreases the difference df i ,f j between correspond to three blocks. We repeat this pruning process
k k
the filter pairs that belong to the set Qi at every epoch until there is a performance drop below a certain threshold,
without affecting the model’s performance much. which is also a hyper-parameter. In our experiments, we
set the threshold value 1%, 2%, and 2% for LeNet-5 [26],
C. Pruning and Re-training VGG-16 [43], and ResNet-56/110 [11] models, respectively.
Using the process of minimizing the difference df i ,f j Next, we discuss the datasets utilized for conducting the
k k
between the filters corresponding to a pair (which belongs to experiments and then we present a comprehensive results
Qi ), we can increase the similarity between the filters that analysis and discussion.
belong to the same filter pair. Thereby, one filter is pruned
from each pair without affecting the model’s performance A. Datasets
much. The pruned model contains the reduced number of In this work, we use two popular and benchmark image
trainable parameters W ′, classification datasets, namely MNIST and CIFAR-10, to
conduct the experiments.
W ′ = W \{p1 , p2 , ..., pk } (9)
1) MNIST: The MNIST dataset [27] consists the images
where p1 , p2 , ..pk are the filters that are selected for pruning of hand-written digits ranging from 0 to 9. This dataset has
after optimization. Further, we re-train the network w.r.t. the 60, 000 training images with 6, 000 training images per class
reduced parameters W ′ to regain the classification perfor- and 10, 000 test images with 1, 000 test images per class. The
mance. As we prune the redundant filters from the network, spatial dimension of the image is 28 × 28 × 1.
the information loss is minimum. Therefore, re-training (fine- 2) CIFAR-10: CIFAR-10 [21] is the most widely used
tuning) makes the network to recover from the loss incurred tiny-scale image dataset which has images belong to 10
due to filter pruning. object categories. The spatial dimension of the image is
32×32×3. This dataset contains a total 60, 000 images with
V. E XPERIMENTAL R ESULTS 6, 000 images per class, out of which 50, 000 images (i.e.,
To demonstrate the significance of the proposed history 5, 000 images per class) are used for both training and fine-
based filter pruning method, we utilize four popular deep tuning the network and the remaining 10, 000 images (i.e.,
learning models, LeNet-5 [26], VGG-16 [43], ResNet-56, 1, 000 images per class) are used for validating the network
and ResNet-110 [11]. All the experiments are conducted on performance.
NVIDIA GTX 1080 Titan Xp GPU. Through our experimen-
tal results, we observe that our method obtains state-of-the- B. LeNet-5 on MNIST
art model compression results for all the above mentioned We utilize LeNet-5 architecture which has two convolution
CNNs. Similar to [44], the regularizer term λ is set to 1 layers Conv1 and Conv2 with 20 and 50 filters, respectively,
Fig. 3. The total number of FLOPS involved in each convolution layer before and after pruning for VGG-16 trained on CIFAR-10 using the proposed
history based filter pruning approach.
TABLE II
Learning (SSL) method pruned 93.42% FLOPs with 1%
T HE PRUNING RESULTS OF VGG-16 ON CIFAR-10 DATASET. T HE
error rate. From Table I (the rows corresponding to the
REPORTED RESULTS ARE TAKEN FROM THE RESPECTIVE RESEARCH
proposed HBFP method), we can examine that the proposed
ARTICLE , EXCEPT * WHICH IS REPRODUCED .
method achieves better classification performance with a high
Model Top-1% Pruned FLOPs Parameters Reduction percent of pruning compared to other methods. The previous
VGG-16 [43] 93.96 313.73M (0.0%) 14.98M (0.0%) work on Correlation Filter Pruning (CFP) by Singh et al. [44]
ℓ1 -norm [29] 93.40 206.0M (34.3%) 5.40M (64.0%)
GM [14] 93.58 201.1M (35.9%) - has reported a similar FLOPs reduction, i.e., 97.98%, how-
VFP [51] 93.18 190.0M (39.1%) 3.92M (73.3%) ever, their error rate is 1.77% which is quite high compared
SSS [18] 93.02 183.13M (41.6%) 3.93M (73.8%)
GAL [34] 90.73 171.89M (45.2%) 3.67M (82.2%) to our method. Singh et al. [44] reported that employing
HBFP-I (Ours) 93.04 90.23M (71.21%) 4.2M (71.8%) a regularizer (in the form of an optimization) reduces the
HBFP-II (Ours) 92.54 75.05M (76.05%) 3.5M (76.56%)
HRank [32] 91.23 73.7M (76.5%) 1.78M (92.0%) information loss occurs due to filter pruning. Motivated by
HBFP-III (Ours) 92.3 62.3M (80.09%) 2.9M (80.47%) this work, we also optimize the network to increase the
CFP [44]∗ 91.83 59.15M (81.14 %) 2.8M (81.1%)
CFP [44] 92.98 56.7M (81.93%) -
similarity between the filters belong to a redundant filter pair.
HBFP-IV (Ours) 91.99 51.9M (83.42%) 2.4M (83.77%) In this process, we reproduce the results of CFP [44] for
which we obtain the top-1 error rate as 2.61% with the same
percent of reduction in the FLOPs (row 10 of Table I).
of spatial dimension 5×5. The first convolution layer Conv1
is followed by a max-pooling layer M ax pool1 with 2 × C. VGG-16 on CIFAR-10
2 filter which results in a feature map of dimension 12 × In 2014, Simonyan et al. [43] proposed a VGG-16 CNN
12 × 20. Similarly, the second convolution layer Conv2 is model that received much attention due to improved perfor-
followed by another max-pooling layer M ax pool2 with 2× mance over ImageNet Large Scale Visual Recognition Chal-
2 dimensional filter which results in a 4×4×50 dimensional lenge (ILSVRC). The same architecture and settings are used
feature map. The feature map resulted by M ax pool2 layer in [43] with few modifications such as batch normalization
is flattened into a 800 × 1 dimensional feature vector which layer [19] is added after every convolution layer. The VGG-
is given as input to the first Fully Connected (FC) layer 16 consists of 14, 982474 trainable parameters and 313.73M
F C1. The F C1 and F C2 layers have 500 and 10 neurons, FLOPs.
respectively. The LeNet architecture corresponds to 431080 Training the VGG-16 network from scratch enables the
trainable parameters and 4.4M FLOPs. model to achieve 93.96% top-1 accuracy on the CIFAR-10
We conduct the pruning experiment on LeNet-5 over the object recognition dataset. The comparison among the state-
MNIST dataset using the proposed HBFP method. Training of-the-art filter pruning methods for VGG-16 on CIFAR-
the network from scratch results in 0.83% base error. The 10 available in the literature is performed in Table II. The
comparison among the benchmark pruning methods for proposed method prunes 83.42% of FLOPs from VGG-16
LeNet-5 is shown in Table I. Compared to the previous which results in 91.99% of top-1 accuracy. The Geometric
pruning methods, the proposed HBFP method achieves a Median method proposed in [14] reported 93.58% top-1
higher reduction in the FLOPs, i.e., 97.98%, however, still accuracy with a 35.9% reduction in the FLOPs. The recent
results in a less error rate, i.e., 1.4%. Structured Sparsity works, such as HRank [32] and Correlation Filter Pruning
TABLE III
T HE PRUNING RESULTS OF R ES N ET-56/110 ON CIFAR-10 DATASET.