0% found this document useful (0 votes)
6 views9 pages

Deep Model Compression Based On The Trai

The document presents a novel method for deep model compression called History Based Filter Pruning (HBFP), which prunes redundant filters in Convolutional Neural Networks (CNNs) based on their training history. The method involves three steps: identifying redundant filters, optimizing the network to minimize information loss, and retraining the network to recover performance. Experimental results demonstrate that HBFP significantly reduces computational complexity while maintaining classification accuracy across various CNN architectures.

Uploaded by

shabbeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

Deep Model Compression Based On The Trai

The document presents a novel method for deep model compression called History Based Filter Pruning (HBFP), which prunes redundant filters in Convolutional Neural Networks (CNNs) based on their training history. The method involves three steps: identifying redundant filters, optimizing the network to minimize information loss, and retraining the network to recover performance. Experimental results demonstrate that HBFP significantly reduces computational complexity while maintaining classification accuracy across various CNN architectures.

Uploaded by

shabbeer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Model Compression based on the Training History

S.H.Shabbeer Basha* , Mohammad Farazuddin* , Viswanath Pulabaigari* , Shiv Ram Dubey* , and Snehasis Mukherjee**
*
Indian Institute of Information Technology Sri City, Chittoor, India.
**
Shiv Nadar University, Gautam Budh Nagar, Uttar Pradesh, India.

Abstract— Deep Convolutional Neural Networks (DCNNs) in architectures with compressed design. Most of the work
have shown promising results in several visual recognition prob- on deep model compression can be broadly categorized into
arXiv:2102.00160v1 [cs.CV] 30 Jan 2021

lems which motivated the researchers to propose popular archi- four classes. The first class [3], [9] of methods are focused on
tectures such as LeNet, AlexNet, VGGNet, ResNet, and many
more. These architectures come at a cost of high computational introducing sparsity into the model parameters. The second
complexity and parameter storage. To get rid of storage and class [35], [42], [41] of methods are aimed at quantization
computational complexity, deep model compression methods based pruning. The third class of methods are dedicated to
have been evolved. We propose a novel “History Based Filter compressing the networks using filter decomposition [30],
Pruning (HBFP)” method that utilizes network training history [6], [31]. The fourth class of methods are focused on pruning
for filter pruning. Specifically, we prune the redundant filters
by observing similar patterns in the filter’s ℓ1 -norms (absolute unimportant filters [1], [5], [50]. The proposed training
sum of weights) over the training epochs. We iteratively prune History Based Filter Pruning (HBFP) method belongs to the
the redundant filters of a CNN in three steps. First, we train the fourth category.
model and select the filter pairs with redundant filters in each In general, filter pruning methods require some metric to
pair. Next, we optimize the network to increase the similarity calculate the importance of a filter. Many metrics have been
between the filters in a pair. It facilitates us to prune one
filter from each pair based on its importance without much proposed to calculate the importance of a filter. For instance,
information loss. Finally, we retrain the network to regain the Abbasi et al. [1] employed a brute-force technique to prune
performance, which is dropped due to filter pruning. We test the filters sequentially that contribute less to the classifi-
our approach on popular architectures such as LeNet-5 on cation performance. However, the brute-force technique is
MNIST dataset and VGG-16, ResNet-56, and ResNet-110 on inefficient while dealing with large neural networks, such
CIFAR-10 dataset. The proposed pruning method outperforms
the state-of-the-art in terms of FLOPs reduction (floating- as AlexNet [22] and VGG-16 [43]. Li et al. [29] pruned the
point operations) by 97.98 %, 83.42 %, 78.43 %, and 74.95 unimportant filters based on their ℓ1 norm. They assumed that
% for LeNet-5, VGG-16, ResNet-56, and ResNet-110 models, the filters with a high ℓ1 norm are most probably important
respectively, while maintaining the less error rate. and will have a larger influence on the relevancy of the
generated feature map.
I. I NTRODUCTION
In this paper, we introduce a novel method for pruning the
In recent years, Convolution al Neural Networks (CNN) redundant filters based on the training history. We iteratively
have gained significant attention from researchers in the field prune redundant filters from a CNN in three stages. First,
of computer vision due to its impeccable performance in we select some (M % of) filter pairs as redundant for which
several tasks including classification and detection [25]. The the sum of the absolute value of the difference between
wide usage of deep CNNs in numerous applications creates the filter’s ℓ1 norm over training epochs is minimum. Next,
an increasing demand for memory (parameter storage) and instead of pruning the filters directly, we reduce the dif-
computation. To address this key issue, various attempts have ference between the filter’s ℓ1 norm in respective epochs
been made in the literature. One such attempt focuses on (which we call optimization) to minimize the complimentary
training the deep CNNs with limited data [8], [23], [48]. information loss and then prune one filter from each pair
Another line of research has shown better performance in based on its magnitude. Finally, we fine-tune (re-train) the
reducing the overhead of computational power and memory network to gain the classification performance, which is
storage, which mainly focuses on model compression by decreased due to filter pruning. The high-level view of the
pruning connections [9], [47] and filters [5], [15], [29]. proposed method is outlined in Fig.1.
Typically increasing the size (requires more storage space) The remaining paper is organized as follows: the related
of a deep neural network makes deploying the model difficult works are compiled in Section II; Preliminaries and notations
on low-end (resource constraint) devices such as mobile are presented in Section III; The proposed History Based
devices and embedded systems. For example, VGG-16 [43] Filter Pruning (HBFP) method is explained in Section IV;
has 138.34 million parameters which require storage space The experimental pruning results are examined in Section V
of more than 500 MB. To reduce the resource overhead of along with the analysis and discussion; Finally, the conclud-
deep CNNs, many attempts have been made to prune less- ing remarks are made in Section VI along with the future
important connections and filters of a CNN which results directives.
Fig. 1. Initially, we start with a heavier model, after identifying the redundant filters, instead of pruning the filters naively at this stage, we optimize the
model to minimize the complimentary information loss that will occur due to filter pruning. After optimizing the network with the custom regularizer, one
filter from each filter pair is pruned. Later the network is re-trained to regain the classification performance. This process is repeated iteratively to obtain
a light-weight CNN.

II. R ELATED W ORKS for deep model compression. Floating-point quantization is


We illustrate the efforts found in the published literature performed in [37] for creating efficient deep neural networks.
for deep model compression, separately by four major cate- Binarization [42] is another popular quantization technique
gories of approaches mentioned in the Introduction section. used for model compression in which each floating-point
value is mapped to a binary value. Bayesian approximation
A. Connection Pruning methods [35] are used for deep model quantization. The
Connection pruning methods induce sparsity into the weight quantization based methods aim to speed up the
neural network. A simple approach is to prune the con- execution process by reducing the complexity of number
nections with unimportant weights (parameters). However, representation and arithmetic & logical operations. However,
this method requires quantifying the significance of the these methods require the support of special hardware to
parameters. In this direction, Lecun et al. [28] and Hassibil capture the benefit of network compression.
et al. [10] have utilized second-order derivative information
to quantify the importance of network connections (parame- C. Filter Decomposition
ters). However, computing second-order derivatives of all the As reported in [4], deep neural networks are over-
connections is expensive. Chen et al. [3] used a low-cost hash parameterized that indicates that the parameters of a layer can
function to group the weights into a single bucket such that be recovered from a subset of the actual parameters belong
weights in the same bucket have roughly the same parameter to the same layer. Motivated by this work, many low-rank
value. Hu et al. [16] introduced a network trimming approach filter decomposition works have been evolved [30], [6], [31].
that iteratively prunes the zero-activation neurons. Wu et al. Unlike filter pruning, which aims at pruning the unimportant
[47] proposed a method called BlockDrop to dynamically filters, these methods decrease the computational cost of the
learn which layers to execute during the inference to reduce network. In this direction, Denton et al. [5] utilized the linear
the total computation time. Han et al. [9] developed a pruning structure of CNNs to find a suitable low-rank approximation
technique based on the absolute value of the parameter. In for the parameters by allowing minimal loss to the network
[9], the parameters with the absolute value below a certain performance. Zhang et al. [50] made use of subsequent non-
threshold are fixed to zero. These pruning methods are linear units for learning low-rank filter decomposition to
suggested in the scenarios where the majority of the network speedup the learning process. Lin et al. [33] introduced a
parameters belong to Fully Connected (FC) layers. For deep low-rank decomposition method to decrease the redundant
models such as ResNet [11] and DenseNet [17], these kind features correspond to convolutional kernels and dense layer
of pruning methods might not be suitable. However, the use matrices.
of modern deep learning models for different applications is
the recent trend. D. Filter Pruning
B. Weight Quantization Compared to other network pruning methods, filter prun-
Weight (Parameter) quantization method is found in the ing methods are generic, which do not require the support
literature used extensively for deep model compression. Han of any special software/hardware. Due to this reason, filter
et al. [9] compressed the deep CNNs by integrating connec- pruning methods have gained popularity among researchers
tion pruning, quantization, and Huffman coding. Similarly, in recent years. In general, filter pruning methods [32],
Tung et al. [45] combined pruning and weight quantization [12], [15], [36] compute the importance of filters so that
unimportant filters can be pruned from the model. In filter- which model is more efficient. We use the terms “Heavy”
pruning methods, after each iteration, retraining is required to and “Light” to represent the models with a higher and lower
regain the classification performance, which is dropped due number of FLOPs, respectively. For a given input feature
to pruning the filters. Abbasi et al. [1] proposed a greedy- map, the number of FLOPs involved in a convolutional layer
based compression scheme for filter pruning. Similarly, Li et Li , i.e., (F LOPconv (Li )), is computed as follows,
al. [29] employed a greedy approach to prune the filters with
less filter norm. In [51], redundant channels are investigated F LOPconv (Li ) = F ∗ F ∗ Cin ∗ Hout ∗ Wout ∗ Cout (1)
based on the distribution of channel parameters. Ding et Here, F ∗ F is the spatial dimension of the filter, Cin is the
al. [7] proposed an auto-balanced method to transfer the number of input channels of the input feature map, Hout and
representation capacity of a convolutional layer to a fraction Wout are the height and width of the output feature map,
of filters belong to the same layer. Other methods such as and Cout is the number of channels in the output feature
Taylor expansion [39], low-rank approximation [5], [20], map. Similarly, for a given input feature map, the number of
[50], group-wise sparsity [24], [46], [52], [2], and many more FLOPs for a Fully Connected (FC) or dense layer Di , i.e.,
are employed to prune the filters from deep neural networks. (F LOPf c (Di )), is given as,
Very recently, Lin et al. [32] proposed a filter pruning method
based on the rank of feature maps in each layer such that the F LOPf c (Di ) = Cin ∗ Cout (2)
filters contributing to low-rank feature maps can be pruned.
So for a model with say K convolution layers and N fully-
Most of the filter pruning methods discussed above prune
connected layers the total number of FLOPs is calculated
the unimportant filters. However, these methods may not
as,
remove the redundant filters from the network. We propose a
K N
novel filter pruning technique that utilizes the training history X X
to find the filters to be pruned. Moreover, in contrast to F LOPtotal = F LOPconv (Li ) + F LOPf c (Dj ) (3)
other class of pruning methods, the proposed filter pruning i=1 j=1

method does not require the support of any additional soft- B. Notations
ware/hardware. In brief, the contributions of this research are Consider a convolution layer Li of a CNN, which has n
summarized as follows, filters, i.e., {f 1 , f 2 , f 3 , ..., f n }. Any two filters f i , f j belong
• We propose a novel method for pruning filters from to the k th layer of a CNN are denoted as fki , fkj , respectively.
convolutional layers based on the training history of a For instance, if the filter fki is of dimension 3 × 3 then
deep neural network. It facilitates the identification of it consists of 9 parameters, i.e., {wk,1 i i
, wk,2 i
, wk,3 i
, ..., wk,9 }.
stable, redundant filters throughout the training that can Here ‘k’ represents the index of convolution layer and ‘i’
have a negligible effect on performance after pruning. denotes the filter index.
• We introduce an optimization step (custom regularizer) Initially, our method computes the ℓ1 norm of each filter
to reduce the information loss incurred due to filter within the same convolution layer using the formula given
pruning. It is achieved by increasing the redundancy in Eq. 4. For example the ℓ1 norm of filter fki is computed
level of selected filters for pruning. as follows,
• To establish the significance of the proposed prun- X9
ing method, experiments are conducted on benchmark ℓ1 (fki ) = kfki k1 = i
wk,p (4)
CNNs like LeNet-5 [26], VGG-16 [43], ResNet-56, p=1
and ResNet-110 [11]. The validation of the proposed where p is the number of parameters in the filter fki . Here,
pruning method is performed over two benchmark clas- we assume the filter fki is of dimension 3 × 3.
sification datasets, including MNIST and CIFAR-10.
Next, we compute the absolute difference between the ℓ1
Also, note that the other pruning methods such as low-rank norms of each filter pair at every epoch as shown in Eq.
approximation methods can be integrated into our method 5. The absolute difference between ℓ1 (fki ) and ℓ1 (fkj ) is
(to decompose dense layers) to obtain a better-compressed computed as follows,
model.
III. P RELIMINARIES AND N OTATIONS df i ,f j (t) = ℓ1 (fki ) − ℓ1 (fkj ) . (5)
k k

Here, we discuss the background details like computing Here, ‘t’ indicates the epoch number. Then, this difference
the Floating Point Operations (FLOPs) involved in a CNN is summed over all the epochs and denoted by Df i ,f j . The
k k
and the notations used in this paper. sum of differences of filter pairs over the training epochs is
A. Calculating Floating Point Operations considered as the metric for filter pruning and given as,
In order to compare the various CNN models, we primarily N
X
use the accuracy on the validation set as a metric to compare Df i ,f j = df i ,f j (t) (6)
k k k k
which model is the most accurate. However, when there are t=1

constraints on computational resources, we use the number of where ‘N ’ indicates the maximum number of epochs used
Floating Point Operations (FLOPs) as a metric to compare for training the networks.
Fig. 2. The overview of the proposed pruning method. At each iteration, we compute the sum of differences Df i ,f j of all filter pairs. In filter selection
k k
stage, we choose M % of filter pairs for which Df i ,f j is minimum. Here, filters f11 and f14 have the least Df i ,f j (absolute point-to-point difference),
k k k k
so we select them as a pair and discard one from the pair with least magnitude. The feature maps corresponding to the pruned filters get dropped.

Our method computes the sum of differences of pairs minimize the difference between the filter’s norms (belong
Df i ,f j value for nC2 filter pairs (assuming a convolution to a filter pair) at each epoch. After optimization, one filter
k k
layer has n filters). The total difference Df i ,f j which is from each pair is discarded (pruned). From a filter pair, we
k k
summed over all the epochs is used for filter selection. More prune one filter based on the criteria employed in [29], i.e.,
concretely, the top M % of the filter pairs (from nC2 pairs the filters with a higher ℓ1 norm is more important. Finally,
in the same layer) with the least Df i ,f j value are formed to recover the model from the performance drop which is
k k
as redundant filter pairs which represent roughly the same incurred due to filter pruning, we retrain the pruned model.
information. The overview of the proposed HBFP method This process corresponds to one iteration of the proposed
is presented in Fig. 2. The proposed History Based Filter pruning method which is demonstrated in Fig.1. This whole
Pruning (HBFP) method involves two key steps, i) Filter process is repeated until the model’s performance drops
selection and ii) Optimization which are presented in the below a certain threshold. Our main contributions are made
next section. specifically in the filter selection and optimization steps.
IV. P ROPOSED T RAINING H ISTORY BASED F ILTER A. Filter Selection
P RUNING (HBFP) M ETHOD In the beginning, our method takes a heavy-weight CNN
Our pruning method aims at making a deep neural network and selects the top M % of filter pairs from each convolution
computationally efficient. This is achieved by pruning the layer for which the difference D computed in Eq. 6 is
redundant filters, i.e., removing of the filters whose removal minimum. More concretely, the filter pair having the least
do not cause much hindrance to the classification perfor- Df i ,f j value is formed as the first redundant filter pair.
k k
mance. We identify the redundant filters by observing the Similarly, the next pair having the second least Df i ,f j value
k k
similar patterns in the weights (parameters) of filters during is formed as another filter pair and so on. Likewise, in each
the network training, which we refer to as the network’s iteration, M % of filter pairs from each convolution layer
training history. We start with a pre-trained CNN model. are selected as redundant which are further considered for
During the network training, we observe and form pairs optimization. Let us define two more terms that are used in
of filters whose weights follow a similar trend over the the proposed pruning method. “Qualified-for-pruning (Qi )”
training epochs. In each iteration, we pick some (M %) of and “Already Pruned (Pi )”. Here Qi represents the set of
the top filter pairs with high similarity (based on the D value filter pairs that are ready (selected) for pruning from a
computed using Eq. 6, low D value means high similarity). convolution layer Li . Whereas, Pi indicates the filters that
Instead of pruning the filters at this stage, we increase the are pruned from the network (one filter from each pair of
similarity between the filters that belong to the selected filter Qi ). Hence, if M % of filter pairs are chosen in Qi , then
pairs by introducing an optimization step. This optimization |P | = M %, i.e., from each convolution layer M % of filters
is achieved with a custom regularizer whose objective is to are pruned in every iteration by the proposed method.
TABLE I
B. Optimization
T HE PRUNING RESULTS OF L E N ET-5 ON MNIST DATASET. T HE ROWS
Singh et al. [44] reported that by introducing an optimiza- CORRESPONDING TO HBFP-I, HBFP-II, HBFP-III, AND HBFP-IV
tion with a custom regularizer decreases the information loss INDICATE THE PRUNING RESULTS OF THE LAST FOUR ITERATIONS OF
incurred due to filter pruning. Motivated by this work, we THE PROPOSED METHOD . H ERE * INDICATES THE REPRODUCED
add a new regularizer to the objective function to reduce RESULTS . T HE RESULTS ARE ARRANGED IN THE INCREASING ORDER OF
the difference df i ,f j (t) between the filters belonging to PRUNED FLOP S REDUCTION %.
k k
Qi at each epoch during network training, i.e., increasing
the similarity between the filters belong to the same pair. Method r1, r2 Top-1% Error Pruned FLOPs
Baseline 20,50 0.83 4.4M (0.0%)
Let C(W ) be the objective function (Cross-entropy loss Sparse-VD [38] - 0.75 2.0M (54.34%)
function) of the deep convolutional neural network with W SBP [40] - 0.86 0.41M (90.47%)
as the network parameters. To minimize the information loss SSL-3 [46] 3,12 1.00 0.28M (93.42%)
and to maximize the regularization capability of the network, HBFP-I (Ours) 4,5 0.98 0.19M (95.57%)
GAL [34] 2,15 1.01 0.1M (95.6%)
we employ a custom regularizer to the objective function,
HBFP-II (Ours) 3,5 1.08 0.15M (96.41%)
which is given as follows: Auto balanced 3,5 2.21 0.15M (96.41%)
 X  CFP [44] 2,3 1.77 0.08 (97.98%)
C1 = exp df i ,f j (t) (7) CFP [44]∗ 2,3 2.61 0.08 (97.98%)
k k
fki ,fkj ∈Qi HBFP-III (Ours) 3,4 1.2 0.13M (96.84%)
HBFP-IV (Ours) 2,3 1.4 0.08M (97.98%)
where t denotes the epoch number and t ∈ 1, 2, 3, ..., N
(assuming we train the model for N epochs). With this new
regularizer, the final objective of the proposed HBFP method for LeNet-5 and ResNet-56/110. However, we empirically
is given by, observe that the value of λ as 0.8 gives better results for
  VGG-16. We prune M % of the filters from each convolution
W = arg min C(W ) + λ ∗ C1 (8) layer simultaneously. The M % is considered as 10% for
W
LeNet-5, VGG-16. Whereas, for ResNet-56 and ResNet-
where λ is the regularizer term which is a hyperparameter. 110, we prune 2, 4, 8 filters from each convolution layer
Optimizing the Eq. 8 decreases the difference df i ,f j between correspond to three blocks. We repeat this pruning process
k k
the filter pairs that belong to the set Qi at every epoch until there is a performance drop below a certain threshold,
without affecting the model’s performance much. which is also a hyper-parameter. In our experiments, we
set the threshold value 1%, 2%, and 2% for LeNet-5 [26],
C. Pruning and Re-training VGG-16 [43], and ResNet-56/110 [11] models, respectively.
Using the process of minimizing the difference df i ,f j Next, we discuss the datasets utilized for conducting the
k k
between the filters corresponding to a pair (which belongs to experiments and then we present a comprehensive results
Qi ), we can increase the similarity between the filters that analysis and discussion.
belong to the same filter pair. Thereby, one filter is pruned
from each pair without affecting the model’s performance A. Datasets
much. The pruned model contains the reduced number of In this work, we use two popular and benchmark image
trainable parameters W ′, classification datasets, namely MNIST and CIFAR-10, to
conduct the experiments.
W ′ = W \{p1 , p2 , ..., pk } (9)
1) MNIST: The MNIST dataset [27] consists the images
where p1 , p2 , ..pk are the filters that are selected for pruning of hand-written digits ranging from 0 to 9. This dataset has
after optimization. Further, we re-train the network w.r.t. the 60, 000 training images with 6, 000 training images per class
reduced parameters W ′ to regain the classification perfor- and 10, 000 test images with 1, 000 test images per class. The
mance. As we prune the redundant filters from the network, spatial dimension of the image is 28 × 28 × 1.
the information loss is minimum. Therefore, re-training (fine- 2) CIFAR-10: CIFAR-10 [21] is the most widely used
tuning) makes the network to recover from the loss incurred tiny-scale image dataset which has images belong to 10
due to filter pruning. object categories. The spatial dimension of the image is
32×32×3. This dataset contains a total 60, 000 images with
V. E XPERIMENTAL R ESULTS 6, 000 images per class, out of which 50, 000 images (i.e.,
To demonstrate the significance of the proposed history 5, 000 images per class) are used for both training and fine-
based filter pruning method, we utilize four popular deep tuning the network and the remaining 10, 000 images (i.e.,
learning models, LeNet-5 [26], VGG-16 [43], ResNet-56, 1, 000 images per class) are used for validating the network
and ResNet-110 [11]. All the experiments are conducted on performance.
NVIDIA GTX 1080 Titan Xp GPU. Through our experimen-
tal results, we observe that our method obtains state-of-the- B. LeNet-5 on MNIST
art model compression results for all the above mentioned We utilize LeNet-5 architecture which has two convolution
CNNs. Similar to [44], the regularizer term λ is set to 1 layers Conv1 and Conv2 with 20 and 50 filters, respectively,
Fig. 3. The total number of FLOPS involved in each convolution layer before and after pruning for VGG-16 trained on CIFAR-10 using the proposed
history based filter pruning approach.

TABLE II
Learning (SSL) method pruned 93.42% FLOPs with 1%
T HE PRUNING RESULTS OF VGG-16 ON CIFAR-10 DATASET. T HE
error rate. From Table I (the rows corresponding to the
REPORTED RESULTS ARE TAKEN FROM THE RESPECTIVE RESEARCH
proposed HBFP method), we can examine that the proposed
ARTICLE , EXCEPT * WHICH IS REPRODUCED .
method achieves better classification performance with a high
Model Top-1% Pruned FLOPs Parameters Reduction percent of pruning compared to other methods. The previous
VGG-16 [43] 93.96 313.73M (0.0%) 14.98M (0.0%) work on Correlation Filter Pruning (CFP) by Singh et al. [44]
ℓ1 -norm [29] 93.40 206.0M (34.3%) 5.40M (64.0%)
GM [14] 93.58 201.1M (35.9%) - has reported a similar FLOPs reduction, i.e., 97.98%, how-
VFP [51] 93.18 190.0M (39.1%) 3.92M (73.3%) ever, their error rate is 1.77% which is quite high compared
SSS [18] 93.02 183.13M (41.6%) 3.93M (73.8%)
GAL [34] 90.73 171.89M (45.2%) 3.67M (82.2%) to our method. Singh et al. [44] reported that employing
HBFP-I (Ours) 93.04 90.23M (71.21%) 4.2M (71.8%) a regularizer (in the form of an optimization) reduces the
HBFP-II (Ours) 92.54 75.05M (76.05%) 3.5M (76.56%)
HRank [32] 91.23 73.7M (76.5%) 1.78M (92.0%) information loss occurs due to filter pruning. Motivated by
HBFP-III (Ours) 92.3 62.3M (80.09%) 2.9M (80.47%) this work, we also optimize the network to increase the
CFP [44]∗ 91.83 59.15M (81.14 %) 2.8M (81.1%)
CFP [44] 92.98 56.7M (81.93%) -
similarity between the filters belong to a redundant filter pair.
HBFP-IV (Ours) 91.99 51.9M (83.42%) 2.4M (83.77%) In this process, we reproduce the results of CFP [44] for
which we obtain the top-1 error rate as 2.61% with the same
percent of reduction in the FLOPs (row 10 of Table I).
of spatial dimension 5×5. The first convolution layer Conv1
is followed by a max-pooling layer M ax pool1 with 2 × C. VGG-16 on CIFAR-10
2 filter which results in a feature map of dimension 12 × In 2014, Simonyan et al. [43] proposed a VGG-16 CNN
12 × 20. Similarly, the second convolution layer Conv2 is model that received much attention due to improved perfor-
followed by another max-pooling layer M ax pool2 with 2× mance over ImageNet Large Scale Visual Recognition Chal-
2 dimensional filter which results in a 4×4×50 dimensional lenge (ILSVRC). The same architecture and settings are used
feature map. The feature map resulted by M ax pool2 layer in [43] with few modifications such as batch normalization
is flattened into a 800 × 1 dimensional feature vector which layer [19] is added after every convolution layer. The VGG-
is given as input to the first Fully Connected (FC) layer 16 consists of 14, 982474 trainable parameters and 313.73M
F C1. The F C1 and F C2 layers have 500 and 10 neurons, FLOPs.
respectively. The LeNet architecture corresponds to 431080 Training the VGG-16 network from scratch enables the
trainable parameters and 4.4M FLOPs. model to achieve 93.96% top-1 accuracy on the CIFAR-10
We conduct the pruning experiment on LeNet-5 over the object recognition dataset. The comparison among the state-
MNIST dataset using the proposed HBFP method. Training of-the-art filter pruning methods for VGG-16 on CIFAR-
the network from scratch results in 0.83% base error. The 10 available in the literature is performed in Table II. The
comparison among the benchmark pruning methods for proposed method prunes 83.42% of FLOPs from VGG-16
LeNet-5 is shown in Table I. Compared to the previous which results in 91.99% of top-1 accuracy. The Geometric
pruning methods, the proposed HBFP method achieves a Median method proposed in [14] reported 93.58% top-1
higher reduction in the FLOPs, i.e., 97.98%, however, still accuracy with a 35.9% reduction in the FLOPs. The recent
results in a less error rate, i.e., 1.4%. Structured Sparsity works, such as HRank [32] and Correlation Filter Pruning
TABLE III
T HE PRUNING RESULTS OF R ES N ET-56/110 ON CIFAR-10 DATASET.

Model Top-1% Pruned FLOPs Parameters Reduction


ResNet-56 [11] 93.26 125.49M (0.0%) 0.85M (0.0%)
VFP [51] 92.26 96.6M (20.3%) 0.67M (20.49%)
ℓ1 -norm [29] 93.06 90.9M (27.6%) 0.73M (14.1%)
NISP [49] 93.01 81.0M (35.5%) 0.49M (42.4%)
HBFP-I (Ours) 92.42 70.81M (43.68%) 0.48M (46.38%)
AMC [13] 91.9 62.74M (50 %) -
CP [15] 91.8 62.74M (50 %) -
GAL [34] 90.36 49.99M (60.2%) 0.29M (65.9%)
HBFP-II (Ours) 92.25 49.22M (60.85%) 0.33M (60.85%)
HRank [32] 90.72 32.52M (74.1%) 0.27M (68.1%)
HBFP-III (Ours) 91.79 31.54M (74.91%) 0.21M (74.9%)
CFP [44]∗ 91.37 31.54M (74.91%) 0.21M (74.9%)
CFP [44] 92.63 29.5M (76.59 %) 3.4M (77.14%)
HBFP-IV (Ours) 91.42 27.1M (78.43%) 0.19M (76.97%)
ResNet-110 [11] 93.5 252.89M (0.0%) 1.72M (0.0%)
VFP [51] 92.96 160.7M (36.44%) 1.01M (41.27%)
ℓ1 -norm [29] 93.3 155.0M (38.7%) 1.16M (32.6%)
GAL [34] 92.55 130.2M (48.5%) 0.95M (44.8%)
HBFP-I (Ours) 93.01 119.7M (52.69%) 0.81M (52.66%)
HBFP-II (Ours) 92.91 98.9M (60.89%) 0.67M (41.27%)
HBFP-III (Ours) 92.83 80.2M (68.31%) 0.54M (68.28%)
HRank [32] 92.65 79.3M (68.6%) 0.53M (68.7%)
HBFP-IV (Ours) 91.96 63.3M (74.95%) 0.43M (74.92%) Fig. 4. Illustrating the effect of pruning results obtained using the proposed
pruning method with and without employing optimization. We can observe,
introducing a custom regularizer to the objective increases the model’s
performance.
(CFP) [44] are able to prune 76.5% and 81.93% FLOPs with
the error rate of 8.77% and 7.02%, respectively, while the
proposed HBFP method is able to prune 83.42% FLOPs with part of Table III, the filter’s ℓ1 -norm based filter pruning
an error rate of 8.01%. The detailed comparison of pruning method obtained 93.3% top-1 accuracy by pruning 38.7%
results for VGG-16 on the CIFAR-10 dataset is demonstrated of the FLOPs. The recent HRank [32] method achieved
in Table II. The FLOPs in each convolution layer before and 92.65% top-1 performance with 68.6% FLOPs reduction.
after employing the proposed pruning method on VGG-16 From Table III, we can note that our method achieves
over CIFAR-10 is illustrated in Fig. 3. a 74.95% FLOPs by removing 74.92% of the trainable
parameters, with a minimum loss of 1.54% as compared to
D. ResNet-56/110 on CIFAR-10
the baseline using ResNet-110 on CIFAR-10. Moreover, our
We also use the deeper and complex CNN models such HBFP-III performs better than HRank [32] in terms of the
as ResNet-56 and ResNet-110 [11] to conduct the pruning accuracy with comparable FLOPs reduction.
experiments over the CIFAR-10 dataset using the proposed
HBFP method. ResNet-56/110 have three blocks of convo- E. Effect of Regularizer
lutional layers with 16, 32, and 64 filters. Training these To investigate the effect of the optimization step, we also
residual models (i.e., ResNet-56 and ResNet-110) with the conduct the experiments without employing the custom reg-
same parameters as in [11] produce 93.26% and 93.5% top-1 ularizer. We show the effect of the regularizer (optimization
accuracies, respectively. The HBFP method prunes 2 filters step) by comparing the classification results obtained for
from the first block which has 16 filters, 4 filters from the VGG-16 on CIFAR-10 using the HBFP method with and
second block which has 32 filters, and 8 filters from the third without employing the regularizer. From Fig. 4, it can be
block which has 64 filters in each iteration of the proposed observed that increasing the similarity between the filters
method. From Table III, it is evident that the proposed belong to a redundant filter pair using a regularizer and
pruning method produces state-of-the-art compression results thereby training the network decreases the information loss
for ResNet-56 and ResNet-110 on CIFAR-10 dataset. that occurs due to filter pruning. The reason for minimum
ResNet-56: As depicted in Table III, both AMC [13] information loss is due to the fact that the optimization
and CP [15] methods have reduced 50.00% FLOPs while step increases the redundancy level among the filters of
resulting in 8.1% and 8.2% error, respectively. The HRank a pair such that removal of one filter does not affect the
[32] prunes 74.1% FLOPs with 9.28% error. From Table III, performance. We also perform similar experiments using
it is clear that the proposed HBFP obtains top-1 accuracy other CNNs, the corresponding results are reported in the
91.42% with high reduction of FLOPs (78.43%) compared Supplementary material.
to HRank [32]. However, the proposed method obtains the
high FLOPs reduction with comparable performance 91.42% VI. C ONCLUSION
as compared to CFP [44]. Moreover, we achieve a better We propose a new filter pruning technique which uses
performance using the proposed HBFP as compared to the the filters’ information at every epoch during network train-
reproduced results using CFP [44]. ing. The proposed History Based Filter Pruning (HBFP)
ResNet-110: As per the results summarized in the lower method is able to prune a higher percent of convolution
filters compared with state-of-the-art pruning methods. At [14] Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter
the same time, the HBFP pruning can produce a less error pruning via geometric median for deep convolutional neural networks
acceleration. In Proceedings of the IEEE Conference on Computer
rate. Eventually, it reduces the FLOPs available in LeNet- Vision and Pattern Recognition, pages 4340–4349, 2019.
5 (97.98%), VGG-16 (83.42%), ResNet-56 (78.43%), and [15] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for
accelerating very deep neural networks. In Proceedings of the
ResNet-110 (74.95%) models. The main finding of this paper IEEE International Conference on Computer Vision, pages 1389–1397,
is to prune the filters that exhibit similar behavior throughout 2017.
the network training as the removal of one such filter from [16] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network
trimming: A data-driven neuron pruning approach towards efficient
a filter pair does not affect the model’s performance greatly. deep architectures. arXiv preprint arXiv:1607.03250, 2016.
According to our study, employing the custom regularizer to [17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
the objective function also improves the classification results. berger. Densely connected convolutional networks. In Proceedings
of the IEEE conference on computer vision and pattern recognition,
We show the importance of the proposed pruning strategy pages 4700–4708, 2017.
through experiments, including LeNet-5 on MNIST dataset [18] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection
for deep neural networks. In Proceedings of the European conference
and VGG-16/ResNet-56/ResNet-110 on CIFAR-10 dataset.
on computer vision (ECCV), pages 304–320, 2018.
One possible direction of future research is pruning the filters [19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
further by considering the similarity among the filters from deep network training by reducing internal covariate shift. arXiv
preprint arXiv:1502.03167, 2015.
different layers. [20] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding
up convolutional neural networks with low rank expansions. arXiv
ACKNOWLEDGEMENTS preprint arXiv:1405.3866, 2014.
[21] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of
features from tiny images. 2009.
We acknowledge the support of NVIDIA for supporting [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
the GTX 1080 Titan XP GPU, which is used for carrying classification with deep convolutional neural networks. In Advances
the experiments of this research. in neural information processing systems, pages 1097–1105, 2012.
[23] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush
Rai. Generalized zero-shot learning via synthesized examples. In
R EFERENCES Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4281–4289, 2018.
[1] Reza Abbasi-Asl and Bin Yu. Structural compression of convolutional [24] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise
neural networks based on greedy filter pruning. arXiv preprint brain damage. In Proceedings of the IEEE Conference on Computer
arXiv:1705.07356, 2017. Vision and Pattern Recognition, pages 2554–2564, 2016.
[2] Jose M Alvarez and Mathieu Salzmann. Learning the number of [25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
neurons in deep networks. In Advances in Neural Information nature, 521(7553):436–444, 2015.
Processing Systems, pages 2270–2278, 2016. [26] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al.
[3] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Gradient-based learning applied to document recognition. Proceedings
Yixin Chen. Compressing neural networks with the hashing trick. of the IEEE, 86(11):2278–2324, 1998.
In International conference on machine learning, pages 2285–2294, [27] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist
2015. handwritten digit database. ATT Labs [Online]. Available:
[4] Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, http://yann.lecun.com/exdb/mnist, 2, 2010.
and Nando De Freitas. Predicting parameters in deep learning. In [28] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage.
Advances in neural information processing systems, pages 2148–2156, In Advances in neural information processing systems, pages 598–605,
2013. 1990.
[5] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and [29] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Pe-
Rob Fergus. Exploiting linear structure within convolutional networks ter Graf. Pruning filters for efficient convnets. arXiv preprint
for efficient evaluation. In Advances in neural information processing arXiv:1608.08710, 2016.
systems, pages 1269–1277, 2014. [30] Yawei Li, Shuhang Gu, Luc Van Gool, and Radu Timofte. Learning
[6] Xiaohan Ding, Guiguang Ding, Yuchen Guo, and Jungong Han. filter basis for convolutional neural network compression. In Proceed-
Centripetal sgd for pruning very deep convolutional networks with ings of the IEEE International Conference on Computer Vision, pages
complicated structure. In Proceedings of the IEEE Conference on 5623–5632, 2019.
Computer Vision and Pattern Recognition, pages 4943–4953, 2019. [31] Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, and Radu
[7] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Auto- Timofte. Group sparsity: The hinge between filter pruning and decom-
balanced filter pruning for efficient convolutional neural networks. In position for network compression. In Proceedings of the IEEE/CVF
Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Conference on Computer Vision and Pattern Recognition, pages 8018–
[8] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model- 8027, 2020.
agnostic meta-learning. In Advances in Neural Information Processing [32] Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang
Systems, pages 9516–9527, 2018. Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using
[9] Song Han, Huizi Mao, and William J Dally. Deep compression: high-rank feature map. In Proceedings of the IEEE/CVF Conference
Compressing deep neural networks with pruning, trained quantization on Computer Vision and Pattern Recognition, pages 1529–1538, 2020.
and huffman coding. arXiv preprint arXiv:1510.00149, 2015. [33] Shaohui Lin, Rongrong Ji, Chao Chen, Dacheng Tao, and Jiebo Luo.
[10] Babak Hassibi and David G Stork. Second order derivatives for Holistic cnn compression via low-rank decomposition with knowledge
network pruning: Optimal brain surgeon. In Advances in neural transfer. IEEE transactions on pattern analysis and machine intelli-
information processing systems, pages 164–171, 1993. gence, 41(12):2889–2905, 2018.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep [34] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan
residual learning for image recognition. In Proceedings of the IEEE Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards
conference on computer vision and pattern recognition, pages 770– optimal structured cnn pruning via generative adversarial learning. In
778, 2016. Proceedings of the IEEE Conference on Computer Vision and Pattern
[12] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Recognition, pages 2790–2799, 2019.
Soft filter pruning for accelerating deep convolutional neural networks. [35] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compres-
arXiv preprint arXiv:1808.06866, 2018. sion for deep learning. In Advances in Neural Information Processing
[13] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Systems, pages 3288–3298, 2017.
Han. Amc: Automl for model compression and acceleration on mobile [36] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level
devices. In Proceedings of the European Conference on Computer pruning method for deep neural network compression. In Proceedings
Vision (ECCV), pages 784–800, 2018. of the IEEE international conference on computer vision, pages 5058–
5066, 2017.
[37] Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. Towards
unified data and lifecycle management for deep learning. In 2017
IEEE 33rd International Conference on Data Engineering (ICDE),
pages 571–582. IEEE, 2017.
[38] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational
dropout sparsifies deep neural networks. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pages
2498–2507. JMLR. org, 2017.
[39] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan
Kautz. Pruning convolutional neural networks for resource efficient
inference. arXiv preprint arXiv:1611.06440, 2016.
[40] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P
Vetrov. Structured bayesian pruning via log-normal multiplicative
noise. In Advances in Neural Information Processing Systems, pages
6775–6784, 2017.
[41] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model
compression via distillation and quantization. arXiv preprint
arXiv:1802.05668, 2018.
[42] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali
Farhadi. Xnor-net: Imagenet classification using binary convolutional
neural networks. In European conference on computer vision, pages
525–542. Springer, 2016.
[43] Karen Simonyan and Andrew Zisserman. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[44] Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay Nam-
boodiri. Leveraging filter correlations for deep model compression.
In The IEEE Winter Conference on Applications of Computer Vision,
pages 835–844, 2020.
[45] Frederick Tung and Greg Mori. Clip-q: Deep network compression
learning by in-parallel pruning-quantization. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
7873–7882, 2018.
[46] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li.
Learning structured sparsity in deep neural networks. In Advances in
neural information processing systems, pages 2074–2082, 2016.
[47] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie,
Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop:
Dynamic inference paths in residual networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
8817–8826, 2018.
[48] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua
Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning.
In Advances in Neural Information Processing Systems, pages 7332–
7342, 2018.
[49] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu,
Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis.
Nisp: Pruning networks using neuron importance score propagation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 9194–9203, 2018.
[50] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian
Sun. Efficient and accurate approximations of nonlinear convolutional
networks. In Proceedings of the IEEE Conference on Computer Vision
and pattern Recognition, pages 1984–1992, 2015.
[51] Chenglong Zhao, Bingbing Ni, Jian Zhang, Qiwei Zhao, Wenjun
Zhang, and Qi Tian. Variational convolutional neural network pruning.
In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2780–2789, 2019.
[52] Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: Towards
compact cnns. In European Conference on Computer Vision, pages
662–677. Springer, 2016.

You might also like