0% found this document useful (0 votes)
22 views10 pages

Dep Pruning

The paper presents a novel approach to filter pruning in convolutional neural networks (CNNs) that considers the dependency between adjacent layers, improving the identification of unimportant filters. It introduces a dynamic mechanism for controlling sparsity-inducing regularization, allowing for more efficient network architecture optimization. Experimental results show that this method outperforms existing techniques on multiple datasets, including CIFAR, SVHN, and ImageNet.

Uploaded by

xiaoyuancaizi02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views10 pages

Dep Pruning

The paper presents a novel approach to filter pruning in convolutional neural networks (CNNs) that considers the dependency between adjacent layers, improving the identification of unimportant filters. It introduces a dynamic mechanism for controlling sparsity-inducing regularization, allowing for more efficient network architecture optimization. Experimental results show that this method outperforms existing techniques on multiple datasets, including CIFAR, SVHN, and ImageNet.

Uploaded by

xiaoyuancaizi02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

Dependency Aware Filter Pruning


Kai Zhao, Xin-Yu Zhang, Qi Han, and Ming-Ming Cheng

Abstract—Convolutional neural networks (CNNs) are typically


over-parameterized, bringing considerable computational over-
head and memory footprint in inference. Pruning a proportion
of unimportant filters is an efficient way to mitigate the inference
cost. For this purpose, identifying unimportant convolutional
filters is the key to effective filter pruning. Previous work
prunes filters according to either their weight norms or the
arXiv:2005.02634v1 [cs.CV] 6 May 2020

corresponding batch-norm scaling factors, while neglecting the


sequential dependency between adjacent layers. In this paper, we
(a) (b) (c)
further develop the norm-based importance estimation by taking
the dependency between the adjacent layers into consideration.
Besides, we propose a novel mechanism to dynamically control Fig. 1. A typical pipeline of filter pruning: (a) train a over-parameterized
the sparsity-inducing regularization so as to achieve the desired model with sparsity-inducing regularization; (b) prune unimportant filters
sparsity. In this way, we can identify unimportant filters and based on certain criteria; (c) finetune the compressed model till convergence.
search for the optimal network architecture within certain
resource budgets in a more principled manner. Comprehensive
experimental results demonstrate the proposed method performs However, existing methods select unimportant filters based
favorably against the existing strong baseline on the CIFAR,
only on the parameter magnitude of a single layer [12]–
SVHN, and ImageNet datasets. The training sources will be
publicly available after the review process. [14], [17]–[19], while neglecting the dependency between
consecutive layers. For example, a specific channel with a
Index Terms—Deep Learning, Nertwork Compression, Filter
Pruning. small BN scaling factor may be followed by a convolution
with a large weight magnitude at that channel, making the
I. I NTRODUCTION channel still important to the output. Besides, in the “smaller
BN factor, less importance” strategy, BN factors from different
Convolutional neural networks (CNNs) have achieved re-
layers are gathered together to rank and determine the filters
markable performance on a wide range of vision and learning
to be pruned. We argue and empirically verify that this
tasks [1]–[11]. Despite the impressive performance, CNNs
strategy is sub-optimal and may lead to unstable network
are notably over-parameterized and thus lead to high computa-
architectures as it neglects the intrinsic statistical variation
tional overhead and memory footprint in inference. Therefore,
among the BN factors of different layers. Empirically, we
network compression techniques are developed to assist the
observe that the pruned architectures of Network Slimming
deployment of CNNs in real-world applications.
[12] are sometimes unbanlanced and lead to severely degraded
Filter pruning is an efficient way to reduce the computa-
performance, especially when the pruning ratio is relatively
tional cost of CNNs with negligible performance degradation.
high.
As shown in Fig. 1, a typical pipeline of filter pruning [12]
works as follows: 1) train an over-parameterized model with In this paper, we propose a dependency-aware filter pruning
the sparsity-inducing regularization; 2) estimate the impor- strategy, which takes the relationship between adjacent layers
tance of each filter and prune the unimportant filters; 3) into consideration. Hence, we measure the filter importance
finetune the compressed model to recover the accuracy. Among in a more principled manner. Along this line, we introduce
these, identifying unimportant filters is the key to efficient a novel criteria to determine the filters to be pruned by the
filter pruning. Prior work [12]–[15] prunes filters according local importance of the consecutive two layers. That is, if
to the magnitude of the corresponding model parameters. For one layer is sparse, then more filters will be pruned and
example, Li et al. [14] prune convolutional filters of smaller vice versa, regardless of the statistics of other layers. Finally,
L1 norms as they are considered to have less impact on the we propose an automatic-regularization-control mechanism in
functionality of the network. Network Slimming [12] then which the coefficient of the sparsity-inducing regularization
proposes to prune channels (i.e., filters) based on the corre- is dynamically adjusted to meet the desired sparsity. Our
sponding scaling factors. To be specific, the scaling factors of contributions are summarized below:
the batch normalization (BN) [16] layer serve as an indicator • We propose a principled criteria of measuring the filter
of the channel importance, on which an L1 regularization importance by taking the dependency between adjacent
is imposed to promote sparsity. As a result, Liu et al. [12] layers into consideration.
derive an automatically searched network architecture of the
• Given the dependency-aware filter importance, we prune
compressed model.
filters based on the local statistics of each layer, instead
The first three students make equal contributions to this paper. of ranking the filter importance across the entire network.
K. Zhao, XY. Zhang, Q. Han, and MM. Cheng are with TKLNDST, CS,
Nankai University, Tianjin, China. • We propose to dynamically control the coefficient of
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

the sparsity-inducing regularization to achieve the desired parameters (either the convolutional weights [14], [27] or the
model sparsity. BN scaling factors [12]) of a single layer. Besides, we propose
Comprehensive experimental results demonstrate that the im- a novel mechanism to dynamically control the coefficient of
proved filter pruning strategy performs favorably against the the sparsity-inducing regularization, instead of pre-defining it
existing strong baseline [12] on the CIFAR, SVHN, and based on human heuristics [12]. Incorporating these compo-
ImageNet datasets. We also validate our design choices with nents, our principled approches and better estimate the filter
several ablation studies and verify that the proposed algorithm importance (Sec. V-A) and achieve more banlanced pruned
reaches more stable and well-performing architectures. architectures (Sec. V-D).

II. R ELATED WORK B. Neural Architecture Search


A. Network pruning While most state-of-the-art CNNs [37]–[39] manually de-
signed by human experts, there is also a line of research
Network pruning is a prevalent technique to reduce re-
that explores automatic network architecture learning [40]–
dundancy in deep neural networks by removing unimportant
[46], called neural architecture search (NAS). Specifically,
neurons. Specifically, weight pruning approaches [20]–[26]
automatically tuning channel width is also studied in NAS. For
remove network parameters without structural constraints,
example, ChamNet [41] builds an accuracy predictor on the
thus leading to unstructured architectures that are not well
Gaussian Process with the Bayesian optimization to predict the
supported by the BLAS libraries. On the other hand, filter
network accuracy with various channel widths in each layer.
pruning methods [12], [14], [27]–[30] remove the entire filters
FBNet [45] adopts a gradient-based method to optimize the
(i.e., channels) from each layer, thus resulting in compact
CNN architecture and search for the optimal channel width.
networks that can be conveniently incorporated into modern
The proposed pruning method can be regarded as a particular
BLAS libraries. According to how to identify the unimportant
case of channel width selection as well, except that we impose
filters, existing filter pruning methods can be further divided
the resource constraints on the selected architecture. However,
into two categories: data-dependent filter pruning and data-
our method learns the architecture through a single training
independent filter pruning.
process, while typical NAS methods may train hundreds of
Data-dependent filter pruning utilizes the training data to
models with different architectures to determine the best-
determine the filters to be pruned. Polyak et al. [31] remove
performing one [41], [46]. We highlight that our efficiency
filters that produce activations of smaller norms. He et al. [28]
is in line with the goal of neural architecture search.
perform a channel selection by minimizing the reconstruction
error. Zheng et al. [32] and Anwar et al. [33] both evaluate
the filter importance via the loss of the validation accuracy C. Other Alternatives for Network Compression
without each filter. Molchanov et al. [29] approximate the a) Low-Rank Decomposition: There is a line of research
exact contribution of each filter with the Taylor expansion. [47]–[50] that aims to approximate the weight matrices of
A recent work [34] proposes a layer-wise recursive Bayesian the neural networks with several low-rank matrices using
pruning method with a dropout-based metric of redundancy. techniques like the Single Value Decomposition (SVD) [47].
Data-independent filter pruning identifies less important However, these methods cannot be applied to the convolutional
filters based merely on the model itself (i.e., model structure weights, and thus the acceleration in inference is limited.
and model parameters). Li et al. [14] discard filters according
b) Weight Quantization: Weight quantization [51]–[55]
to the L1 norm of the corresponding parameters as filters
reduces the model size by using a low bit-width number of
with smaller weights are considered to contribute less to the
the weights and hidden activations. For example, Courbariaux
output. Network Slimming [12] imposes a sparsity-inducing
et al. [52] and Rastegari et al. [53] quantize the real-valued
regularization on the scaling factors of the BN layer and then
weights into binary or ternary ones, i.e., the weight values
prunes filters with smaller scaling factors. Zhou [35] using
are restricted to {−1, 1} or {−1, 0, 1}. Cheng et al. [55]
the evolutionary algorithm to search redundant filters during
quantize CNNs with a predefined codebok. Despite the sig-
training. He et al. [18] propose to dynamically prune filters
nificant model-size reduction and inference acceleration, these
during training. In another work He et al. [27] propose to
methods often come with a mild accuracy drop due to the low
prune filters that are close to the geometric median. They argue
precision.
that filters near the geometric median are more likely to be
represented by others [36], thus leading to redundancy.
III. D EPENDENCY-AWARE F ILTER P RUNING
Our method belongs to the data-independent filter pruning,
which is generally more efficient as involving the training data A. Dependency Analysis
brings extra computation. For example, Zheng et al. [32] and Generally, we assume a typical CNN involves multiple
Anwar et al. [33] measure the importance of each filter by convolution operators (Conv layers), batch normalizations (BN
removing the filter and re-evaluating the compressed model layers) [16], and non-linearities, which are applied to the input
on the validation set. This procedure is extremely time- signals sequentially as in Fig. 2. Practically, each channel
consuming. Essentially, we take the dependency between the is transformed independently in the BN layers and non-
consecutive layers into consideration, while previous data- linearities, while inter-channel information is fused in the Conv
independent methods [12], [14], [27] merely focus on the layers. To prune filters (i.e., channels) with minimal impact on
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Conv kernels
feature maps scaled feature maps
f l+1 ||
||W 4
Yl
Xl
γ4l f l+1 ||
||W 2

γ3l ~
γ2l

γ1l

Fig. 2. Illustration of the “batch normalization, convolution” sequential. The input feature X l (after whitening) is stretched by the scaling factors γ l and
convolved with three convolutional kernels, resulting in a three-channel output feature. For the dependency-aware criterion, the importance of the cth feature
map is estimated by the product of the absolute value of the scaling factor |γcl | and the magnitude of corresponding convolutional kernel ||W fcl+1 ||. (See
Eq. (7).)

the network output, we analyze the role each channel plays in


the Conv layers as follows. Cl
C l
X X
l l l
Let X l ∈ RC ×H ×W be the hidden activations after ||Fe l+1 || ≤ f l+1 Z
||Wc
e l || ≤
c
f l+1 || · ||Z
||Wc
e l ||
c
c=1 c=1
normalization before scaling in the lth BN layer. The scaled l
C
activations Y l can be formulated as∗ X
f l+1 || · L||Ye l ||
≤ ||Wc c
Ycl = γcl Xcl , (1) c=1
l
l C
where γ l ∈ RC denotes the scaling factor of the lth BN layer =L
X
|γcl | · ||W
f l+1 || · ||X
fl ||, (6)
c c
and Xcl (resp. Ycl ) is the cth channel of X l (resp. Y l ). Then, a c=1
Lipschitz-continuous non-linearity σ is applied to Y l , namely,
fl
where L denotes the Lipschitz constant of function σ, and X
Z l = σ(Y l ). (2) l l l
and Y are the unfolded versions of X and Y , respectively.
e
Afterward, all channels of Z l are fused into F l+1 ∈ Since the normalization operation in BN layer uniformize the
l+1
RC ×H ×W
l+1 l+1
via a convolution operation, and different activations Xcl (i.e., Xfl ) across channels, we quantify the
c
th
channels contribute to the fused activation F l+1 differently. contribution of the c channel by
l+1 l
Formally, let W l+1 ∈ RC ×C ×k×k be the (l + 1)th Scl = |γcl | · ||W
f l+1 ||, (7)
c
convolution filter, where k denotes the kernel size. We have
which serves as our metric for network pruning.
F l+1 = W l+1 ~ Z l , (3)
where ~ denotes the convolution operator. As convolution B. Filter Selection
is an affine transformation, we re-formulate the linearity of Let r ∈ (0, 1) be the pruning ratio, and C l (l ∈
Eq. (3) explicitly: {1, 2, · · · , L}) be the number of filters in the lth convolutional
layer. Generally, previous works can be divided into two
Fe l+1 = W
f l+1 Z
el, (4) groups according to the target network.
l+1 l+1 l+1 l+1 2 l
where Fe l+1 ∈ RC ×H W , W f l+1 ∈ RC ×k C , and a) Pruning with Pre-defined Target Network: Many pre-
Ze l ∈ Rk2 C l ×H l+1 W l+1 are the unfolded versions of F l+1 , vious work [17], [18], [27] prune a fixed ratio of filters in each
W l+1 , and Z l , respectively. Factorize Fe l+1 along the channel layer. In other words, there will be r·C l filters pruned from the
axis, and we have lth layer. The architecture of the target network is known even
C
X
l without pruning. However, recent work [13], [60] reveals that
Fe l+1 = f l+1 Z
Wc
el ,
c (5) this stretagy cannot find the optimal distribution of the neuron
c=1 numbers of each convolutional layer across the network, as
f l+1 ∈ RC l+1 ×k2 and Z e l ∈ Rk2 ×H l+1 W l+1 . Then, some layers will be over-parameterized while some under-
where W c c parameterized.
we analyze the contribution of each channel as follows:†
b) Pruning as Architecture Search: Network Slim-
ming [12] treats pruning as a special form of architecture
search, i.e., search for the optimal channel width of each
layer. It compares the importance of each convolutional filter
∗ For simplicity, we omit the shifting parameters in a typical BN layer,
across the entire network and prunes filters of less importance.
and the bias term in Eq. (3). This approach provides more flexibility of the compressed
† Here, we assume the non-linearity provides zero activations given zero
inputs, and most widely-used non-linearities, such as ReLU [56] and its architecture as a higher pruning ratio can be achieved if a
variants [57]–[59], satisfy this property. specific layer is sparse and vice versa.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

However, according to our practice, we find that sometimes Algorithm 1: Automatic Regularization Control
too many filters of a layer (or occasionally all filters of a layer)
Initialize λ1 = 0, P1 = 0, N = #epochs
are pruned in this strategy, leading to severely degraded perfor-
mance. This is because it does not take the intrinsic statistical for t := 1 to N do train for 1 epoch
P l
variation among different layers into consideration. Suppose l |Fpruned |
Pt = P
C l
there are two layers and the corresponding scaling factors are l

{0.10, 0.01, 0.03, 0.15} and {1, 100, 2, 200}, respectively. Our if Pt − Pt−1 < r−Pt−1
N −t+1 then
target is to prune half of the filters, i.e., r = 0.5. Apparently, λt+1 = λt + ∆λ
the second and third channels should be pruned from the else if Pt > r then
first layer, and the first and third channels should be pruned λt+1 = λt − ∆λ
from the second layer. However, if we rank the scaling factors end
globally, all filters of the first layer will be pruned, which is
obviously unreasonable.
To alleviate this issue, we instead select the unimportant the CIFAR [61] datasets in Sec. IV-B and the ImageNet [62]
filters based on the intra-layer statistics. Let Scl be the impor- dataset in Sec. IV-D.
tance of the cth channel in the lth layer. Then, filters with
importance factor Scl ≤ max(S l ) · p will be pruned, where the
threshold p ∈ (0, 1) is a hyper-parameter. Formally, the set of A. Implementation Details
filters to be pruned in the lth layer is: Our implementation is based on the official training sources
l
of Network Slimming in the PyTorch [63] library.‡ We follow
Fpruned = {c : Scl ≤ max(S l ) · p}. (8) the “train, prune, and finetune” pipeline as depicted in Fig. 1.
In our solution, the choice of the filters to be pruned in one a) Datasets and Data Augmentation: We conduct image
layer is made independent of the statistics of other layers, so classification experiments on the CIFAR [61], SVHN [64], and
that the intrinsic statistical differences among layers will not ImageNet [62] datasets. For the CIFAR and SVHN datasets,
result in dramatically unbalanced neural architecture. we follow the common practice of data augmentation: zero-
padding of 4 pixels on each side of the image and random
C. Automatic Control of Sparsity Regularization cropp of a 32 × 32 patch. On the ImageNet dataset, we
Network Slimming [12] imposes an L1 regularization on adopt the standard data augmentation strategy as in the prior
the model parameters to promote model sparsity. However, work [37], [39], [59], [65]: resize images to have the shortest
choosing a proper regularization coefficient λ is non-trivial edge of 256 pixels and then randomly crop a 224 × 224
and mostly requires manual tuning based on human heuristics. patch. Besides, we adopt random horizontal flip on the cropped
For example, Network Slimming performs a grid search in image for the CIFAR and ImageNet datasets. The input data
a set of candidate coefficients for each dataset and network is normalized by subtracting the channel-wise means and
architecture. However, different pruning ratios require different dividing the channel-wise standard deviations before being fed
levels of model sparsity, and thus different coefficients λ. It is to the network.
extremely inefficient to tune λ for each experimental setting. b) Backbone Architectures: We evaluate the proposed
To escape from manually choosing λ and meet the required method on two representative architectures: VGGNet [39]
model sparsity at the same time, we propose to automatically and ResNet [37]. Following the practice of Network Slim-
control the regularization coefficient λ. Following the practice ming [12], we use the Pre-Act-ResNet architecture [65] in
in [12], an L1 regularization is imposed on the scaling factors which the BN layers and non-linearities are placed before the
of the batch normalization layers. As shown in Alg. 1, at the convolutional layers. (See Fig. 3.)
end of the tth epoch, we calculate the overall sparsity of the
model: c) Hyper-parameters: The threshold in Eq. (8) is set
l to 0.01 unless otherwise specified, and ∆λ = 10−5 in all
P
l |Fpruned |
P = P l . (9) experiments. We use the SGD optimizer with a momentum of
lC
0.9 and a weight decay of 10−4 . The initial learning rate is
Given the total number of epochs N , we compute the expected 0.1 and divided by a factor of 10 at the specified epochs. We
sparsity gain, and if the sparsity gain within an epoch does train for 160 epochs on the CIFAR datasets and 40 epochs on
not meet the requirement, i.e., Pt − Pt−1 < (r − Pt−1 )/(N − the SVHN dataset. The learning rate decays at 50% and 75%
t + 1), the regularization coefficient λ is increased by ∆λ . of the total training epochs. On the ImageNet dataset, we train
If the model is over-sparse, i.e., Pt > r, the coefficient λ for 100 epochs and decay the learning rate every 30 epochs.
is decreased by ∆λ . This strategy guarantees that the model
meets the desired model sparsity, and that the pruned filters d) Half-precision Training on ImageNet: We train mod-
contribute negligibly to the outputs. els on the ImageNet dataset with half-precision (FP16), using
the Apex library,§ where parameters of batch normalization are
IV. E XPERIMENTAL R ESULTS represented in FP32 while others in FP16. This allows us to
In this section, we first describe the details of our imple- ‡ https://github.com/Eric-mingjie/rethinking-network-pruning

mentation in Sec. IV-A, and report the experimental results on § https://github.com/NVIDIA/apex


JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

feature selection residual path

Conv1 Conv2 Conv3

BN1 BN2 BN3


Input Output

Fig. 3. Illustration of pruning the bottleneck structure. Planes and grids represent feature maps and convolutional kernels, respectively. The dotted planes and
blank grids denote the pruned feature channels and the corresponding convolutional filters. We perform “feature selection” after the first batch-norm layer,
and prune only the input dimension of the last convolutional layer. Consequently, the number of channels is unchanged in the residual path.

train the ResNet-50 model within 40 hours on 4 RTX 2080Ti


GPUs. Despite training with FP16, we do not observe obvious
performance degradation in our experiments. For example, as
shown in Tab. IV, we achieve a top-1 accuracy of 76.27%
with the Pre-ResNet-50 architecture on the ImageNet dataset,
TABLE I
which is very close to that in the original paper [65] or reported E XPERIMENTAL RESULTS ON THE CIFAR10 DATASET. O UR METHOD
in [29]. PERFORMS FAVORABLY AGAINST THE N ETWORK S LIMMING (SLM) [12]
BASELINE AND OTHER PRUNING METHODS ON BOTH VGGN ETS AND
e) Train, Prune, and, Finetune: We adopt the three-stage R ES N ETS .
pipeline, i.e., train, prune, and finetune, as in many previous
pruning methods [12], [17], [18], [27], [30], [66]. (See Fig. 1.) Baseline Finetune
Model Methods ratio r
In the experiments, we found that in the first stage, the accuracy accuracy
model sparsity P grows rapidly when the learning rate is SLM [12] 92.13 (±0.18) 91.91 (±0.01)
VGG11 0.5
Ours 92.02 (±0.20) 92.17 (±0.18)
large. After the learning rate decays, the model sparsity hardly
SLM 93.73 (±0.06) 93.65 (±0.04)
increases unless an extremely large λ is reached. Therefore, to VGG16 0.6
Ours 93.57 (±0.26) 93.70 (±0.05)
effectively promote model sparsity, we keep the learning rate
SLM (from [67]) 93.53 (±0.16) 93.60 (±0.16)
fixed in the first stage, and decays the learning rate normally VGG19 0.7
Ours 93.66 (±0.26) 93.53 (±0.24)
when in the third stage. On CIFAR datasets, we train for 160
SFP 93.59 (±0.58) 92.26 (±0.31)
epochs for the first stage, and on the ImageNet dataset, we
ASFP [18] 93.59 (±0.58) 92.44 (±0.31)
train only 40 epochs for the first stage. On both CIFAR and
FPGM [27] 0.4 93.59 (±0.58) 92.93 (±0.49)
ImageNet datasets, we finetune for a full episode.
SLM 93.56 (±0.19) 93.33 (±0.14)
Res56
f) Prune with Short Connections: In the Pre-Act-ResNet Ours 93.73 (±0.10) 93.86 (±0.19)
architecture, operators are arranged in the “BN, ReLU, and SLM 93.56 (±0.19) 92.90 (±0.14)
0.5
Conv” order. As depicted in Fig. 3, given the input feature Ours 93.73 (±0.10) 93.62 (±0.16)
maps, we perform a “feature selection” right after the first SLM 93.56 (±0.19) 91.94 (±0.10)
0.6
batch normalization layer (BN1) to filter out less important Ours 93.73 (±0.10) 92.68 (±0.15)
channels according to the dependency-aware channel impor- SFP 93.68 (±0.32) 93.38 (±0.30)
tance (Eq. (7)). For the first and second convolutional layers ASFP [18] 93.68 (±0.58) 93.20 (±0.10)
(Conv1 and Conv2), we prune both the input and output FPGM [27] 0.4 93.68 (±0.32) 93.73 (±0.23)
dimensions of their kernels. (The pruned channels are repre- SLM 94.61 (±0.01) 94.49 (±0.12)
Res110
sented as the dotted planes in Fig. 3.) For the last convolutional Ours 94.43 (±0.13) 94.75 (±0.12)
layer (Conv3), we prune only the input dimension of Conv3 to SLM
0.5
94.61 (±0.01) 94.24 (±0.13)
preserve the structure of the residual path. After pruning, the Ours 94.43 (±0.13) 94.52 (±0.26)
number of channels in the residual path remains unchanged. SLM
0.6
94.61 (±0.01) 93.47 (±0.15)
Note that when computing the model sparsity (Eq. (9)), the Ours 94.43 (±0.13) 94.57 (±0.04)
“feature selection” is not taken into account because it does not SLM (from [67]) 95.04 (±0.16) 94.77 (±0.12)
0.4
actually prune any filters. For example, in the case of Fig. 3, Ours 94.86 (±0.10) 95.01 (±0.15)
there are only 2 filters pruned, i.e., the second filter of Conv2 SLM 95.04 (±0.16) 94.52 (±0.09)
Res164 0.5
and the first filter of Conv3. Ours 94.86 (±0.10) 94.83 (±0.05)
SLM (from [67]) 95.04 (±0.16) 94.23 (±0.21)
0.6
Ours 94.86 (±0.10) 94.53 (±0.35)
B. Results on CIFAR
We first evaluate our method on the CIFAR10 and CI-
FAR100 datasets. Experiments on the CIFAR datasets are
conducted using the VGGNets and ResNets with various
depths. On the CIFAR datasets, we record the mean and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE II TABLE III


E XPERIMENTAL RESULTS ON THE CIFAR100 DATASET. H ERE , “N/A” E XPERIMENTAL RESULTS ON THE SVHN DATASET. S IMILARLY, “N/A”
INDICATES THE COMPRESSED MODEL COLLAPSES IN ALL RUNS . S TILL , INDICATES THE COMPRESSED MODEL COLLAPSES IN ALL RUNS . I T CAN
OUR APPROACH CONSISTENTLY OUTPERFORMS THE N ETWORK S LIMMING BE SEEN THAT OUR APPROACH IS TOLERANT OF HIGH PRUNING RATIOS
(SLM) [12] BASELINE . N OTABLY, OUR APPROACH OUTPERFORMS AND OUTPERFORMS THE N ETWOEK S LIMMING (SLM) [12] BASELINE
N ETWORK S LIMMING BY UP TO 2% ON THE R ES N ET-164 BACKBONE . UNDER VARIOUS EXPERIMENTAL SETTINGS .

Baseline Finetune Baseline Finetune


Model Methods ratio r Model Methods ratio r
accuracy accuracy accuracy accuracy
SLM [12] 69.33 (±0.26) 66.54 (±0.14) SLM 95.85 (±0.07) 95.82 (±0.18)
VGG11 0.3 0.2
Ours 68.24 (±0.11) 67.84 (±0.11) Ours 95.85 (±0.07) 96.18 (±0.09)
SLM 73.50 (±0.18) 73.36 (±0.28) SLM 95.85 (±0.07) 95.77 (±0.13)
0.3 0.4
Ours 72.16 (±0.23) 73.59 (±0.37) Ours 95.85 (±0.07) 96.20 (±0.11)
VGG16 Res20
SLM 73.50 (±0.18) N/A SLM 95.85 (±0.07) 95.66 (±0.07)
0.4 0.6
Ours 72.16 (±0.23) 73.59 (±0.23) Ours 95.85 (±0.07) 96.15 (±0.05)
SLM (from [67]) 72.63 (±0.21) 72.32 (±0.28) SLM 95.85 (±0.07) N/A
VGG19 0.5 0.8
Ours 71.19 (±0.54) 72.48 (±0.28) Ours 95.85 (±0.07) 95.49 (±0.13)
SLM (from [67]) 76.80 (±0.19) 76.22 (±0.20) SLM 96.87 (±0.04) 96.62 (±0.05)
0.4 0.2
Ours 76.43 (±0.26) 77.74 (±0.17) Ours 96.87 (±0.04) 97.04 (±0.08)
Res164
SLM 76.80 (±0.19) 74.17 (±0.33) SLM 96.87 (±0.04) 96.56 (±0.07)
0.6 0.4
Ours 76.43 (±0.26) 76.28 (±0.27) Ours 96.87 (±0.04) 97.00 (±0.02)
Res56
SLM 96.87 (±0.04) N/A
0.6
94.5 Ours 96.87 (±0.04) 97.03 (±0.02)
SLM 96.87 (±0.04) N/A
94.0 Ours
0.8
96.87 (±0.04) 96.77 (±0.05)
93.5
Accuracy

93.0 and CIFAR100 datasets, VGGNet-16 achieves better (or com-


parable) performance than VGGNet-19. These observations
92.5 demonstrate the VGGNet is heavily over-parameterized for
Ours the CIFAR datasets, and that pruning a proportion of filters
92.0 Slimming brings negligible influence to the performance.
0.1 0.2 0.3 0.4 0.5 0.6 b) ResNets: Pruning the ResNet architectures is more
pruning ratio r complicated because of the residual paths. As described
in Sec. IV-A and Fig. 3, we preserve the number of channels in
Fig. 4. Performance (mean and standard deviation over a 10-fold validation)
of pruning the ResNet-56 network on the CIFAR10 dataset under various
the residual path and only prune filters inside the bottleneck
pruning ratios r. architecture. By pruning the same proportion of filters, our
method consistently achieves better results compared with the
Network Slimming [12] baseline.
standard deviation over a 10-fold validation. It is worthy of
noting that, as described in Sec. III-B, Network Slimming
[12] often results in unstable architectures, whose performance C. Results on SVHN
is greatly degraded. (See Sec. V-D for details.) Therefore, We then apply the proposed pruning algorithm to the ResNet
for Network Slimming, we skip the outliers and restart the family on the SVHN dataset, following the same evaluation
pipeline if the accuracy is 10% lower than the mean accuracy. protocol as in Sec. IV-B. It can be seen from Tab. III that our
Quantitative results on CIFAR10 and CIFAR100 datasets are approach outperforms the state-of-the-art baseline method [12]
summarized in Tab. I and Tab. II, respectively. Additionally, a under various model depths and pruning ratios. Also, Network
curve of the classification accuracy v.s. the pruning ratio r is Slimming [12] often collapses when the pruning ratio is high,
shown in Fig. 4. e.g., 80%, while our approach is more tolerant of high pruning
a) VGGNets: We start with the simpler architecture, ratios and still maintains a competitive accuracy. For example,
VGGNet, which is a sequential architecture without skip con- only an accuracy of 0.10% is sacrisficed for 80% of filters
nections. We find that pruning a large number of filters brings being pruned from the ResNet-56 backbone. Furthermore,
a puny performance drop. Take the VGGNet-19 as an example. similar to the circumstances on the CIFAR datasets, pruning
On the CIFAR10 dataset, with 70% of the filters pruned, a proportion of filters may even bring a performance gain
both Network Slimming and our method even bring a little (e.g., when 20% or 40% of filters are pruned), indicating a
performance gain. And interestingly, increasing model depth moderate pruning ratio can alleviate the over-fitting problem
does not always enhance performance. On both CIFAR10 on the relatively small datasets, such as CIFAR and SVHN.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE IV TABLE V
I MAGE CLASSIFICATION RESULTS ON THE I MAGE N ET DATASET. O UR T HE PERFORMANCE OF DIFFERENT STRATEGIES BEFORE AND AFTER
METHOD CONSISTENTLY OUTPERFORMS THE DATA - INDEPENDENT FINETUNING ARE DEMONSTRATED IN THE TABLE .
PRUNING METHODS [12], [14], [19], [66], AND ACHIEVES COMPETITIVE
PERFORMANCE AGAINST THE DATA - DEPENDENT METHOD [29]. Before After
Model Methods ratio r
Finetune Finetune
#Params #FLOPs SLM 0.3 52.19 (±6.82) 73.36 (±0.28)
Model Methods ratio r Acc. (%)
(107 ) (106 ) VGG16 SLM+DA 0.3 61.19 (±6.18) 73.57 (±0.31)
Baseline - 70.84 3.18 7.61 SLM+DA+Auto 0.3 72.83 (±0.26) 73.59 (±0.37)
VGG11 SLM [12] 0.50 68.62 1.18 6.93 SLM 0.5 1.41 (±0.25) 71.13 (±0.26)
Ours 0.50 69.12 1.18 6.97 Res56 SLM+DA 0.5 5.29 (±1.01) 73.62 (±0.14)
Baseline - 76.27 2.56 4.13 SLM+DA+Auto 0.5 55.29 (±1.92) 74.53 (±0.10)
ThiNet [66] 0.50 71.01 1.24 3.48
ThiNet 0.70 68.42 0.87 2.20
Li et al. [14] N/A 72.04 1.93 2.76 With the same pruning ratio, e.g., r = 0.5, we assume
SSR-L2,1 [68] N/A 72.13 1.59 1.9 that the importance estimation is more accurate if the pruned
Res50 SSR-L2,0 [68] N/A 72.29 1.55 1.9 model (without finetuning) achieves higher performance on the
SLM 0.50 71.99 1.11 1.87 validation set. Thus, the accuracy of importance estimation can
Ours 0.50 72.41 1.07 1.86 be measured by the performance of pruned networks under
Taylor [29] 0.19 75.48 1.79 2.66 the same pruning ratio. In this experiment, we compare the
SLM 0.20 75.12 1.78 2.81 following three strategies: (a) Network Slimming [12] which
Ours 0.20 75.37 1.76 2.82 measures filter importance by the batch-norm scaling factors
Baseline - 77.37 4.45 7.86
only; (b) the dependency-aware importance estimation in
Ye et al. [19]-v1 N/A 74.56 1.73 3.69
Eq. (7); and (c) the dependency-aware importance estimation
Ye et al. [19]-v2 N/A 75.27 2.36 4.47
+ automatic regularization control.
Taylor [29] 0.45 75.95 2.07 2.85
Firstly, we conduct an illustrative experiment on the
Res101 SLM 0.50 75.97 2.09 3.16
VGGNet-16 backbone with a pruning ratio of 0.3. As shown
in Fig. 5, the strategy (c) obtains a compressed model with the
Ours 0.50 76.54 2.17 3.23
desired sparsity and achieves the best accuracy after finetuning.
Taylor [29] 0.25 77.35 3.12 4.70
Then, we quantitatively compare these three strategies on the
Ours 0.20 77.36 3.18 4.81
VGGNet-16 and ResNet-56 backbones. The statistics over a
10-fold validation are reported in Tab. V.
D. Results on ImageNet The results in Tab. V reveal that 1) the dependency-aware
importance estimation is able to measure the filter importance
Here, we evaluate the proposed method on the large-scale
more accurately as it achieves a much higher performance
and challenging ImageNet [62] benchmark. The results of
before finetuning compared with the Network Slimming, and
Network Slimming [12] and our method are obtained from
2) the automatic regularization control assists to derive a model
our implementation, while other results come from the orig-
with desired sparsity and search for a better architecture,
inal papers. We compare against several recently-proposed
evidenced by the favorable performance after finetuning.
pruning methods with various criterion, including the weight
norm [14], norm of batch-norm factors [12], [19], and a data-
dependent pruning method [29]. As summarized in Tab. IV, B. Fixed v.s. Adjustable Regularization Coefficient
under the same pruning ratios, our method consistently out- There are two alternative approaches that can help achieve
performs the Network Slimming baseline, and retains a com- the desired mode sparsity: (a) fix the threshold p and adjust
parable number of parameters and complexity (FLOPs). Even the regularization coefficient λ during training; and (b) fix λ
compared with the data-dependent pruning method [29], our and search for a suitable p after training.
method still achieves competitive performance. We compare these two alternatives on the ResNet-56 back-
bone with a pruning ratio of 0.5, which means 50% of the
V. A BLATION S TUDY filters will be pruned. For strategy (a), the regularization
In this section, we conduct several ablation studies to justify coefficient λ is fixed to 10−5 , as suggested by [12].
our design choice. All the experiments in this section are As shown in Tab. VI, under the same pruning ratio, strategy
conducted on the CIFAR100 dataset. (a) performs favorably against strategy (b) in terms of the

A. The Effectiveness of Dependency-aware Importance Esti-


TABLE VI
mation C OMPARISON OF THE TWO ALTERNATIVES OF REACHING THE DESIRED
MODEL SPARSITY.
In the first ablation study, we verify that our method can
more accurately identify less important filters, thus leading to Before Before After
Method threshold p
a better compressed architecture. This can be evidenced by 1) Pruning Finetune Finetune
the less performance drop after pruning, and 2) the better final (a) 60.86 0.01 60.86 75.24
performance after finetuning. (b) 73.59 0.41 1.53 74.36
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

1e 4 Sparse penalty (train) Model sparsity (train) Accuracy (finetune)


75
2.5 0.30
70
Sparse penalty ( )

2.0 0.25

Top1 Accuracy
Model sparsity
1.5 0.20
65
0.15
1.0
0.10 60
0.5 Slimming
Slimming+Dependency 0.05
Slimming+Dependency+Auto 55
0.0 0.00
0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200
(a) Sparsity Regularization (b) Model Sparsity (c) Finetune Accuracy

Fig. 5. Training dynamics of pruning the VGGNet-16 backbone (r = 0.3) on the CIFAR100 dataset with the three different strategies. The horizontal axis
represents the training epochs in all three plots. Plot (a), (b), and (c) represent the regularization coefficient λ, model sparsity P , and the finetune accuracy,
respectively. Compared with the Network Slimming baseline, the dependency-aware importance estimation assists to identify less important filters, leading to
higher performance before/after finetuning. Then, equipped with the automatic regularization control, the model meets the desired sparsity at the end of the
first stage, and achieves the best performance after finetuning.

TABLE VII TABLE VIII


T HE PERFORMANCE OF TRAINING THE COMPRESSED ARCHITECTURE R ECORD OF A 5- FOLD VALIDATION ON THE CIFAR DATASETS WITH THE
FROM SCRATCH . B Y TRAINING THE PRUNED MODEL WITH RANDOMLY VGGN ET-16 BACKBONE . I N THE TABLE , (·/·) INDICATES THE FINRTUNE
RE - INITIALIZED WEIGHTS , OUR METHOD CAN STILL OUTPERFORM THE ACCURACY AND THE MINIMAL NUMBER OF REMAINING CHANNELS IN
N ETWORK S LIMMING (SLM) [12] BASELINE , IMPLYING THAT OUR EACH LAYER AFTER PRUNING .
APPROACH DERIVES A BETTER NETWORK ARCHITECTURE .
Dataset Method ratio r run-1 run-2 run-3 run-4 run-5
Accuracy (%) SLM 10.00 / 0 10.00 / 0 10.00 / 0 10.00 / 0 10.00 / 0
Model Method CIFAR10 0.7
Baseline Finetune Scratch Ours 93.93 / 24 93.66 / 25 93.94 / 27 93.70 / 23 93.89 / 27
SLM 76.80 (±0.19) 74.17 (±0.33) 75.05 (±0.08) CIFAR100
SLM
0.4
1.00 / 0 1.00 / 1 1.00 / 0 1.00 / 0 1.00 / 0
Res164
Ours 76.43 (±0.26) 76.43 (±0.27) 76.41 (±0.32) Ours 73.24 / 29 73.60 / 37 73.92 / 35 73.47 / 37 73.71 / 37

performance before and after finetuning. This justifies our We use the VGGNet-16 backbone with a pruning ratio of 0.4.
design of dynamically adjusting λ during training. The filter distributions of compressed architectures are shown
in Fig. 6.
In the second experiment, we conduct a 5-fold validation
C. Pruning as Architecture Search
on the CIFAR10 and CIFAR100 datasets, again using the
As pointed out in Sec. III-B, Network Slimming [12] may VGGNet-16 backbone. The results in Tab. VIII indicate that
lead to unreasonable compressed architectures as too many under a relatively high pruning ratio, our method can still
filters can be pruned in a single layer. In this experiment, achieve high performance while Network Slimming collapses
we verify that our method can derive better compressed in all runs.
architectures. To test the difference of the pruned architectures,
we re-initialize the parameters of pruned models, and then
VI. C ONCLUSION
train the pruned models for a full episode as in the standard
pipeline. Note that we are essentially training the compressed In this paper, we propose a principled criteria to iden-
architecture from scratch under the “scratch-E” setting in tify the unimportant filters with consideration of the inter-
[67]. The results in Tab. VII indicate that our method derives layer dependency. Based on this, we prune filters based on
better compressed architectures, as evidenced by the superior
performance when training from scratch.
512 Baseline
Slimming [12]
number filters

D. Pruning Stability Ours

As stated in Sec. III-B, Network Slimming [12] selects


filters to be pruned by ranking channel importance of dif-
256
ferent layers across the entire network, leading to unstable
128
architectures. We empirically verify the claim that with a large 64
pruning ratio, our method can still achieve promising results, 0
v1-1 v1-2 v2-1 v2-2 v3-1 v3-2 v3-3 v4-1 v4-2 v4-3 v5-1 v5-2 v5-3
while Network Slimming leads to collapsed models with a
high probability. con con con con con con con con con con con con con
Here, we design two experiments. In the first experiment, Fig. 6. Filter distributions of the pruned VGGNet-16 backbone. Network
we give an intuitionistic illustration of the compressed network Slimming [12] presents an unbalanced architecture where conv5-1 has two
architecture induced by Network Slimming and our method. filters remained and conv5-2 has only one filter remained.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

the local channel importance, and introduce an automatic- [17] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for ac-
regularization-control mechanism to dynamically adjust the celerating deep convolutional neural networks,” in IJCAI. International
Joint Conferences on Artificial Intelligence, 2018, pp. 2234–2240.
coefficient of sparsity regularization. In the end, our method [18] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang, “Asymptotic soft
is able to compress the state-of-the-art neural networks with filter pruning for deep convolutional neural networks,” IEEE transactions
a minimal accuracy drop. Comprehensive experimental re- on cybernetics, 2019.
[19] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-
sults on CIFAR, SVHN, and ImageNet datasets demonstrate less-informative assumption in channel pruning of convolution layers,”
that our approach performs favorably against the Network in International Conference on Learning Representations, 2018.
Slimming [12] baseline and achieve competitive performance [20] M. A. Carreira-Perpinán and Y. Idelbayev, “learning-compression algo-
among the concurrent data-dependent and data-independent rithms for neural net pruning,” in International Conference on Computer
Vision and Pattern Recognition, 2018, pp. 8532–8541.
pruning approaches, indicating the essential role of the inter- [21] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks
layer dependency in principled filter pruning algorithms. via layer-wise optimal brain surgeon,” in Neural Information Processing
Systems, 2017, pp. 4857–4867.
[22] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
ACKNOWLEDGMENTS dnns,” in Neural Information Processing Systems, 2016, pp. 1379–1387.
This research was supported by Major Project for New [23] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Neural Information Processing
Generation of AI under Grant No. 2018AAA0100400, NSFC Systems, 2015, pp. 1135–1143.
(61922046), the national youth talent support program, and [24] B. Hassibi and D. G. Stork, “Second order derivatives for network
Tianjin Natural Science Foundation (18ZXZNGX00110). pruning: Optimal brain surgeon,” in Neural Information Processing
Systems, 1993, pp. 164–171.
[25] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
R EFERENCES Neural Information Processing Systems, 1990, pp. 598–605.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [26] S. Srinivas, A. Subramanya, and R. Venkatesh Babu, “Training sparse
with deep convolutional neural networks,” in Neural Information Pro- neural networks,” in CVPRW, 2017, pp. 138–145.
cessing Systems, 2012, pp. 1097–1105. [27] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geo-
[2] G. Deshpande, P. Wang, D. Rangaprakash, and B. Wilamowski, “Fully metric median for deep convolutional neural networks acceleration,” in
connected cascade artificial neural network architecture for attention International Conference on Computer Vision and Pattern Recognition,
deficit hyperactivity disorder classification from functional magnetic 2019, pp. 4340–4349.
resonance imaging data,” IEEE transactions on cybernetics, vol. 45, [28] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
no. 12, pp. 2668–2679, 2015. deep neural networks,” in International Conference on Computer Vision,
[3] R. Girshick, “Fast r-cnn,” in International Conference on Computer 2017, pp. 1398–1406.
Vision and Pattern Recognition, 2015, pp. 1440–1448. [29] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks estimation for neural network pruning,” in International Conference on
for semantic segmentation,” in International Conference on Computer Computer Vision and Pattern Recognition. IEEE, 2019, pp. 11 264–
Vision and Pattern Recognition, 2015, pp. 3431–3440. 11 272.
[5] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep- [30] L. Zeng and X. Tian, “Accelerating convolutional neural networks by
resentation by joint identification-verification,” in Advances in neural removing interspatial and interkernel redundancies,” IEEE transactions
information processing systems, 2014, pp. 1988–1996. on cybernetics, vol. 50, no. 2, pp. 452–464, 2018.
[6] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spectral [31] A. Polyak and L. Wolf, “Channel-level acceleration of deep face
clustering by exploring intertask correlation,” IEEE transactions on representations,” IEEE Access, vol. 3, pp. 2163–2175, 2015.
cybernetics, vol. 45, no. 5, pp. 1083–1094, 2014. [32] Z. Zheng, Z. Li, A. Nagar, and K. Park, “Compact deep neural networks
[7] X. Chang, Z. Ma, Y. Yang, Z. Zeng, and A. G. Hauptmann, “Bi-level for device based image classification,” in 2015 IEEE International
semantic representation analysis for multimedia event detection,” IEEE Conference on Multimedia & Expo Workshops, ICME Workshops 2015,
transactions on cybernetics, vol. 47, no. 5, pp. 1180–1197, 2016. Turin, Italy, June 29 - July 3, 2015, 2015, pp. 1–6.
[8] M. Luo, X. Chang, L. Nie, Y. Yang, A. G. Hauptmann, and Q. Zheng, [33] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
“An adaptive semisupervised feature analysis for video semantic recog- convolutional neural networks,” ACM Journal on Emerging Technologies
nition,” IEEE transactions on cybernetics, vol. 48, no. 2, pp. 648–660, in Computing Systems (JETC), vol. 13, no. 3, p. 32, 2017.
2017. [34] Y. Zhou, Y. Zhang, Y. Wang, and Q. Tian, “Accelerate cnn via recursive
[9] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, “Stacked bayesian pruning,” in International Conference on Computer Vision,
convolutional denoising auto-encoders for feature representation,” IEEE 2018.
transactions on cybernetics, vol. 47, no. 4, pp. 1017–1027, 2016.
[35] Y. Zhou, G. G. Yen, and Z. Yi, “A knee-guided evolutionary algorithm
[10] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal
for compressing deep neural networks,” IEEE transactions on cybernet-
retrieval with cnn visual features: A new baseline,” IEEE transactions
ics, 2019.
on cybernetics, vol. 47, no. 2, pp. 449–460, 2016.
[11] L. Zhang and P. N. Suganthan, “Visual tracking with convolutional ran- [36] P. T. Fletcher, S. Venkatasubramanian, and S. Joshi, “Robust statistics
dom vector functional link network,” IEEE transactions on cybernetics, on riemannian manifolds via the geometric median,” in International
vol. 47, no. 10, pp. 3243–3253, 2016. Conference on Computer Vision and Pattern Recognition. IEEE, 2008,
[12] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning pp. 1–8.
efficient convolutional networks through network slimming,” in Inter- [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
national Conference on Computer Vision, 2017, pp. 2736–2744. image recognition,” in International Conference on Computer Vision
[13] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and and Pattern Recognition, 2016, pp. 770–778.
E. Choi, “Morphnet: Fast & simple resource-constrained structure learn- [38] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
ing of deep networks,” in International Conference on Computer Vision connected convolutional networks,” in International Conference on
and Pattern Recognition, 2018, pp. 1586–1595. Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
filters for efficient convnets,” in International Conference on Learning large-scale image recognition,” in ICLR, 2015.
Representations, 2017. [40] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural Architecture
[15] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured Search on Target Task and Hardware,” in International Conference on
sparsity in deep neural networks,” in Neural Information Processing Learning Representations, 2019.
Systems, 2016, pp. 2074–2082. [41] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan,
[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep Y. Hu, Y. Wu, Y. Jia et al., “Chamnet: Towards efficient network design
network training by reducing internal covariate shift,” in International through platform-aware model adaptation,” in International Conference
Conference on Machine Learning, 2015. on Computer Vision and Pattern Recognition, 2019, pp. 11 398–11 407.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

[42] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, [64] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture “Reading digits in natural images with unsupervised feature learning,”
search,” in European Conference on Computer Vision, 2018, pp. 19–34. in NeurIPS Workshop on Deep Learning and Unsupervised Feature
[43] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural Learning, 2011.
architecture search via parameter sharing,” in International Conference [65] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
on Machine Learning, 2018. networks,” in European Conference on Computer Vision. Springer,
[44] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, 2016, pp. 630–645.
and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for [66] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method
mobile,” in International Conference on Computer Vision and Pattern for deep neural network compression,” in International Conference on
Recognition, 2019, pp. 2820–2828. Computer Vision, 2017, pp. 5058–5066.
[45] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, [67] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
Y. Jia, and K. Keutzer, “FBNet: Hardware-aware efficient convnet value of network pruning,” in International Conference on Learning
design via differentiable neural architecture search,” in International Representations, 2019.
Conference on Computer Vision and Pattern Recognition, 2019, pp. [68] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact convnets
10 734–10 742. via structure-sparsity regularized filter pruning,” IEEE transactions on
[46] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement neural networks and learning systems, 2019.
learning,” in International Conference on Learning Representations,
2017.
[47] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting linear structure within convolutional networks for efficient
evaluation,” in Neural Information Processing Systems, 2014, pp. 1269–
1277.
[48] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for
small-footprint deep learning,” in Neural Information Processing Sys-
tems, 2015, pp. 3088–3096.
[49] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu-
tional networks for classification and detection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–
1955, 2015.
[50] H. Huang and H. Yu, “Ltnn: A layerwise tensorized compression of
multilayer neural network,” IEEE transactions on neural networks and
learning systems, vol. 30, no. 5, pp. 1497–1511, 2018.
[51] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compress-
ing neural networks with the hashing trick,” in International Conference
on Machine Learning, 2015, pp. 2285–2294.
[52] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
gio, “Binarized neural networks: Training deep neural networks with
weights and activations constrained to +1 or-1,” arXiv preprint
arXiv:1602.02830, 2016.
[53] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–
542.
[54] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
neural networks for mobile devices,” in International Conference on
Computer Vision and Pattern Recognition, 2016, pp. 4820–4828.
[55] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: A
unified approach to accelerate and compress convolutional networks,”
IEEE transactions on neural networks and learning systems, vol. 29,
no. 10, pp. 4730–4743, 2017.
[56] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in International Conference on Machine Learn-
ing, 2010, pp. 807–814.
[57] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in International Conference
on Machine Learning, 2013.
[58] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep
network learning by exponential linear units (elus),” in International
Conference on Learning Representations, 2016.
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
International Conference on Computer Vision, 2015, pp. 1026–1034.
[60] Y. Li, Z. Kuang, Y. Chen, and W. Zhang, “Data-driven neuron allo-
cation for scale aggregation networks,” in International Conference on
Computer Vision and Pattern Recognition, 2019, pp. 11 526–11 534.
[61] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
from tiny images,” Citeseer, Tech. Rep., 2009.
[62] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252,
2015.
[63] B. Steiner, Z. DeVito, S. Chintala, S. Gross, A. Paszke, F. Massa,
A. Lerer, G. Chanan, Z. Lin, E. Yang et al., “Pytorch: An imperative
style, high-performance deep learning library,” in Neural Information
Processing Systems, 2019.

You might also like