Dep Pruning
Dep Pruning
8, AUGUST 2015 1
the sparsity-inducing regularization to achieve the desired parameters (either the convolutional weights [14], [27] or the
model sparsity. BN scaling factors [12]) of a single layer. Besides, we propose
Comprehensive experimental results demonstrate that the im- a novel mechanism to dynamically control the coefficient of
proved filter pruning strategy performs favorably against the the sparsity-inducing regularization, instead of pre-defining it
existing strong baseline [12] on the CIFAR, SVHN, and based on human heuristics [12]. Incorporating these compo-
ImageNet datasets. We also validate our design choices with nents, our principled approches and better estimate the filter
several ablation studies and verify that the proposed algorithm importance (Sec. V-A) and achieve more banlanced pruned
reaches more stable and well-performing architectures. architectures (Sec. V-D).
Conv kernels
feature maps scaled feature maps
f l+1 ||
||W 4
Yl
Xl
γ4l f l+1 ||
||W 2
γ3l ~
γ2l
γ1l
Fig. 2. Illustration of the “batch normalization, convolution” sequential. The input feature X l (after whitening) is stretched by the scaling factors γ l and
convolved with three convolutional kernels, resulting in a three-channel output feature. For the dependency-aware criterion, the importance of the cth feature
map is estimated by the product of the absolute value of the scaling factor |γcl | and the magnitude of corresponding convolutional kernel ||W fcl+1 ||. (See
Eq. (7).)
However, according to our practice, we find that sometimes Algorithm 1: Automatic Regularization Control
too many filters of a layer (or occasionally all filters of a layer)
Initialize λ1 = 0, P1 = 0, N = #epochs
are pruned in this strategy, leading to severely degraded perfor-
mance. This is because it does not take the intrinsic statistical for t := 1 to N do train for 1 epoch
P l
variation among different layers into consideration. Suppose l |Fpruned |
Pt = P
C l
there are two layers and the corresponding scaling factors are l
{0.10, 0.01, 0.03, 0.15} and {1, 100, 2, 200}, respectively. Our if Pt − Pt−1 < r−Pt−1
N −t+1 then
target is to prune half of the filters, i.e., r = 0.5. Apparently, λt+1 = λt + ∆λ
the second and third channels should be pruned from the else if Pt > r then
first layer, and the first and third channels should be pruned λt+1 = λt − ∆λ
from the second layer. However, if we rank the scaling factors end
globally, all filters of the first layer will be pruned, which is
obviously unreasonable.
To alleviate this issue, we instead select the unimportant the CIFAR [61] datasets in Sec. IV-B and the ImageNet [62]
filters based on the intra-layer statistics. Let Scl be the impor- dataset in Sec. IV-D.
tance of the cth channel in the lth layer. Then, filters with
importance factor Scl ≤ max(S l ) · p will be pruned, where the
threshold p ∈ (0, 1) is a hyper-parameter. Formally, the set of A. Implementation Details
filters to be pruned in the lth layer is: Our implementation is based on the official training sources
l
of Network Slimming in the PyTorch [63] library.‡ We follow
Fpruned = {c : Scl ≤ max(S l ) · p}. (8) the “train, prune, and finetune” pipeline as depicted in Fig. 1.
In our solution, the choice of the filters to be pruned in one a) Datasets and Data Augmentation: We conduct image
layer is made independent of the statistics of other layers, so classification experiments on the CIFAR [61], SVHN [64], and
that the intrinsic statistical differences among layers will not ImageNet [62] datasets. For the CIFAR and SVHN datasets,
result in dramatically unbalanced neural architecture. we follow the common practice of data augmentation: zero-
padding of 4 pixels on each side of the image and random
C. Automatic Control of Sparsity Regularization cropp of a 32 × 32 patch. On the ImageNet dataset, we
Network Slimming [12] imposes an L1 regularization on adopt the standard data augmentation strategy as in the prior
the model parameters to promote model sparsity. However, work [37], [39], [59], [65]: resize images to have the shortest
choosing a proper regularization coefficient λ is non-trivial edge of 256 pixels and then randomly crop a 224 × 224
and mostly requires manual tuning based on human heuristics. patch. Besides, we adopt random horizontal flip on the cropped
For example, Network Slimming performs a grid search in image for the CIFAR and ImageNet datasets. The input data
a set of candidate coefficients for each dataset and network is normalized by subtracting the channel-wise means and
architecture. However, different pruning ratios require different dividing the channel-wise standard deviations before being fed
levels of model sparsity, and thus different coefficients λ. It is to the network.
extremely inefficient to tune λ for each experimental setting. b) Backbone Architectures: We evaluate the proposed
To escape from manually choosing λ and meet the required method on two representative architectures: VGGNet [39]
model sparsity at the same time, we propose to automatically and ResNet [37]. Following the practice of Network Slim-
control the regularization coefficient λ. Following the practice ming [12], we use the Pre-Act-ResNet architecture [65] in
in [12], an L1 regularization is imposed on the scaling factors which the BN layers and non-linearities are placed before the
of the batch normalization layers. As shown in Alg. 1, at the convolutional layers. (See Fig. 3.)
end of the tth epoch, we calculate the overall sparsity of the
model: c) Hyper-parameters: The threshold in Eq. (8) is set
l to 0.01 unless otherwise specified, and ∆λ = 10−5 in all
P
l |Fpruned |
P = P l . (9) experiments. We use the SGD optimizer with a momentum of
lC
0.9 and a weight decay of 10−4 . The initial learning rate is
Given the total number of epochs N , we compute the expected 0.1 and divided by a factor of 10 at the specified epochs. We
sparsity gain, and if the sparsity gain within an epoch does train for 160 epochs on the CIFAR datasets and 40 epochs on
not meet the requirement, i.e., Pt − Pt−1 < (r − Pt−1 )/(N − the SVHN dataset. The learning rate decays at 50% and 75%
t + 1), the regularization coefficient λ is increased by ∆λ . of the total training epochs. On the ImageNet dataset, we train
If the model is over-sparse, i.e., Pt > r, the coefficient λ for 100 epochs and decay the learning rate every 30 epochs.
is decreased by ∆λ . This strategy guarantees that the model
meets the desired model sparsity, and that the pruned filters d) Half-precision Training on ImageNet: We train mod-
contribute negligibly to the outputs. els on the ImageNet dataset with half-precision (FP16), using
the Apex library,§ where parameters of batch normalization are
IV. E XPERIMENTAL R ESULTS represented in FP32 while others in FP16. This allows us to
In this section, we first describe the details of our imple- ‡ https://github.com/Eric-mingjie/rethinking-network-pruning
Fig. 3. Illustration of pruning the bottleneck structure. Planes and grids represent feature maps and convolutional kernels, respectively. The dotted planes and
blank grids denote the pruned feature channels and the corresponding convolutional filters. We perform “feature selection” after the first batch-norm layer,
and prune only the input dimension of the last convolutional layer. Consequently, the number of channels is unchanged in the residual path.
TABLE IV TABLE V
I MAGE CLASSIFICATION RESULTS ON THE I MAGE N ET DATASET. O UR T HE PERFORMANCE OF DIFFERENT STRATEGIES BEFORE AND AFTER
METHOD CONSISTENTLY OUTPERFORMS THE DATA - INDEPENDENT FINETUNING ARE DEMONSTRATED IN THE TABLE .
PRUNING METHODS [12], [14], [19], [66], AND ACHIEVES COMPETITIVE
PERFORMANCE AGAINST THE DATA - DEPENDENT METHOD [29]. Before After
Model Methods ratio r
Finetune Finetune
#Params #FLOPs SLM 0.3 52.19 (±6.82) 73.36 (±0.28)
Model Methods ratio r Acc. (%)
(107 ) (106 ) VGG16 SLM+DA 0.3 61.19 (±6.18) 73.57 (±0.31)
Baseline - 70.84 3.18 7.61 SLM+DA+Auto 0.3 72.83 (±0.26) 73.59 (±0.37)
VGG11 SLM [12] 0.50 68.62 1.18 6.93 SLM 0.5 1.41 (±0.25) 71.13 (±0.26)
Ours 0.50 69.12 1.18 6.97 Res56 SLM+DA 0.5 5.29 (±1.01) 73.62 (±0.14)
Baseline - 76.27 2.56 4.13 SLM+DA+Auto 0.5 55.29 (±1.92) 74.53 (±0.10)
ThiNet [66] 0.50 71.01 1.24 3.48
ThiNet 0.70 68.42 0.87 2.20
Li et al. [14] N/A 72.04 1.93 2.76 With the same pruning ratio, e.g., r = 0.5, we assume
SSR-L2,1 [68] N/A 72.13 1.59 1.9 that the importance estimation is more accurate if the pruned
Res50 SSR-L2,0 [68] N/A 72.29 1.55 1.9 model (without finetuning) achieves higher performance on the
SLM 0.50 71.99 1.11 1.87 validation set. Thus, the accuracy of importance estimation can
Ours 0.50 72.41 1.07 1.86 be measured by the performance of pruned networks under
Taylor [29] 0.19 75.48 1.79 2.66 the same pruning ratio. In this experiment, we compare the
SLM 0.20 75.12 1.78 2.81 following three strategies: (a) Network Slimming [12] which
Ours 0.20 75.37 1.76 2.82 measures filter importance by the batch-norm scaling factors
Baseline - 77.37 4.45 7.86
only; (b) the dependency-aware importance estimation in
Ye et al. [19]-v1 N/A 74.56 1.73 3.69
Eq. (7); and (c) the dependency-aware importance estimation
Ye et al. [19]-v2 N/A 75.27 2.36 4.47
+ automatic regularization control.
Taylor [29] 0.45 75.95 2.07 2.85
Firstly, we conduct an illustrative experiment on the
Res101 SLM 0.50 75.97 2.09 3.16
VGGNet-16 backbone with a pruning ratio of 0.3. As shown
in Fig. 5, the strategy (c) obtains a compressed model with the
Ours 0.50 76.54 2.17 3.23
desired sparsity and achieves the best accuracy after finetuning.
Taylor [29] 0.25 77.35 3.12 4.70
Then, we quantitatively compare these three strategies on the
Ours 0.20 77.36 3.18 4.81
VGGNet-16 and ResNet-56 backbones. The statistics over a
10-fold validation are reported in Tab. V.
D. Results on ImageNet The results in Tab. V reveal that 1) the dependency-aware
importance estimation is able to measure the filter importance
Here, we evaluate the proposed method on the large-scale
more accurately as it achieves a much higher performance
and challenging ImageNet [62] benchmark. The results of
before finetuning compared with the Network Slimming, and
Network Slimming [12] and our method are obtained from
2) the automatic regularization control assists to derive a model
our implementation, while other results come from the orig-
with desired sparsity and search for a better architecture,
inal papers. We compare against several recently-proposed
evidenced by the favorable performance after finetuning.
pruning methods with various criterion, including the weight
norm [14], norm of batch-norm factors [12], [19], and a data-
dependent pruning method [29]. As summarized in Tab. IV, B. Fixed v.s. Adjustable Regularization Coefficient
under the same pruning ratios, our method consistently out- There are two alternative approaches that can help achieve
performs the Network Slimming baseline, and retains a com- the desired mode sparsity: (a) fix the threshold p and adjust
parable number of parameters and complexity (FLOPs). Even the regularization coefficient λ during training; and (b) fix λ
compared with the data-dependent pruning method [29], our and search for a suitable p after training.
method still achieves competitive performance. We compare these two alternatives on the ResNet-56 back-
bone with a pruning ratio of 0.5, which means 50% of the
V. A BLATION S TUDY filters will be pruned. For strategy (a), the regularization
In this section, we conduct several ablation studies to justify coefficient λ is fixed to 10−5 , as suggested by [12].
our design choice. All the experiments in this section are As shown in Tab. VI, under the same pruning ratio, strategy
conducted on the CIFAR100 dataset. (a) performs favorably against strategy (b) in terms of the
2.0 0.25
Top1 Accuracy
Model sparsity
1.5 0.20
65
0.15
1.0
0.10 60
0.5 Slimming
Slimming+Dependency 0.05
Slimming+Dependency+Auto 55
0.0 0.00
0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200
(a) Sparsity Regularization (b) Model Sparsity (c) Finetune Accuracy
Fig. 5. Training dynamics of pruning the VGGNet-16 backbone (r = 0.3) on the CIFAR100 dataset with the three different strategies. The horizontal axis
represents the training epochs in all three plots. Plot (a), (b), and (c) represent the regularization coefficient λ, model sparsity P , and the finetune accuracy,
respectively. Compared with the Network Slimming baseline, the dependency-aware importance estimation assists to identify less important filters, leading to
higher performance before/after finetuning. Then, equipped with the automatic regularization control, the model meets the desired sparsity at the end of the
first stage, and achieves the best performance after finetuning.
performance before and after finetuning. This justifies our We use the VGGNet-16 backbone with a pruning ratio of 0.4.
design of dynamically adjusting λ during training. The filter distributions of compressed architectures are shown
in Fig. 6.
In the second experiment, we conduct a 5-fold validation
C. Pruning as Architecture Search
on the CIFAR10 and CIFAR100 datasets, again using the
As pointed out in Sec. III-B, Network Slimming [12] may VGGNet-16 backbone. The results in Tab. VIII indicate that
lead to unreasonable compressed architectures as too many under a relatively high pruning ratio, our method can still
filters can be pruned in a single layer. In this experiment, achieve high performance while Network Slimming collapses
we verify that our method can derive better compressed in all runs.
architectures. To test the difference of the pruned architectures,
we re-initialize the parameters of pruned models, and then
VI. C ONCLUSION
train the pruned models for a full episode as in the standard
pipeline. Note that we are essentially training the compressed In this paper, we propose a principled criteria to iden-
architecture from scratch under the “scratch-E” setting in tify the unimportant filters with consideration of the inter-
[67]. The results in Tab. VII indicate that our method derives layer dependency. Based on this, we prune filters based on
better compressed architectures, as evidenced by the superior
performance when training from scratch.
512 Baseline
Slimming [12]
number filters
the local channel importance, and introduce an automatic- [17] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for ac-
regularization-control mechanism to dynamically adjust the celerating deep convolutional neural networks,” in IJCAI. International
Joint Conferences on Artificial Intelligence, 2018, pp. 2234–2240.
coefficient of sparsity regularization. In the end, our method [18] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan, and Y. Yang, “Asymptotic soft
is able to compress the state-of-the-art neural networks with filter pruning for deep convolutional neural networks,” IEEE transactions
a minimal accuracy drop. Comprehensive experimental re- on cybernetics, 2019.
[19] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-
sults on CIFAR, SVHN, and ImageNet datasets demonstrate less-informative assumption in channel pruning of convolution layers,”
that our approach performs favorably against the Network in International Conference on Learning Representations, 2018.
Slimming [12] baseline and achieve competitive performance [20] M. A. Carreira-Perpinán and Y. Idelbayev, “learning-compression algo-
among the concurrent data-dependent and data-independent rithms for neural net pruning,” in International Conference on Computer
Vision and Pattern Recognition, 2018, pp. 8532–8541.
pruning approaches, indicating the essential role of the inter- [21] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks
layer dependency in principled filter pruning algorithms. via layer-wise optimal brain surgeon,” in Neural Information Processing
Systems, 2017, pp. 4857–4867.
[22] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
ACKNOWLEDGMENTS dnns,” in Neural Information Processing Systems, 2016, pp. 1379–1387.
This research was supported by Major Project for New [23] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Neural Information Processing
Generation of AI under Grant No. 2018AAA0100400, NSFC Systems, 2015, pp. 1135–1143.
(61922046), the national youth talent support program, and [24] B. Hassibi and D. G. Stork, “Second order derivatives for network
Tianjin Natural Science Foundation (18ZXZNGX00110). pruning: Optimal brain surgeon,” in Neural Information Processing
Systems, 1993, pp. 164–171.
[25] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
R EFERENCES Neural Information Processing Systems, 1990, pp. 598–605.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [26] S. Srinivas, A. Subramanya, and R. Venkatesh Babu, “Training sparse
with deep convolutional neural networks,” in Neural Information Pro- neural networks,” in CVPRW, 2017, pp. 138–145.
cessing Systems, 2012, pp. 1097–1105. [27] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geo-
[2] G. Deshpande, P. Wang, D. Rangaprakash, and B. Wilamowski, “Fully metric median for deep convolutional neural networks acceleration,” in
connected cascade artificial neural network architecture for attention International Conference on Computer Vision and Pattern Recognition,
deficit hyperactivity disorder classification from functional magnetic 2019, pp. 4340–4349.
resonance imaging data,” IEEE transactions on cybernetics, vol. 45, [28] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
no. 12, pp. 2668–2679, 2015. deep neural networks,” in International Conference on Computer Vision,
[3] R. Girshick, “Fast r-cnn,” in International Conference on Computer 2017, pp. 1398–1406.
Vision and Pattern Recognition, 2015, pp. 1440–1448. [29] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance
[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks estimation for neural network pruning,” in International Conference on
for semantic segmentation,” in International Conference on Computer Computer Vision and Pattern Recognition. IEEE, 2019, pp. 11 264–
Vision and Pattern Recognition, 2015, pp. 3431–3440. 11 272.
[5] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep- [30] L. Zeng and X. Tian, “Accelerating convolutional neural networks by
resentation by joint identification-verification,” in Advances in neural removing interspatial and interkernel redundancies,” IEEE transactions
information processing systems, 2014, pp. 1988–1996. on cybernetics, vol. 50, no. 2, pp. 452–464, 2018.
[6] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spectral [31] A. Polyak and L. Wolf, “Channel-level acceleration of deep face
clustering by exploring intertask correlation,” IEEE transactions on representations,” IEEE Access, vol. 3, pp. 2163–2175, 2015.
cybernetics, vol. 45, no. 5, pp. 1083–1094, 2014. [32] Z. Zheng, Z. Li, A. Nagar, and K. Park, “Compact deep neural networks
[7] X. Chang, Z. Ma, Y. Yang, Z. Zeng, and A. G. Hauptmann, “Bi-level for device based image classification,” in 2015 IEEE International
semantic representation analysis for multimedia event detection,” IEEE Conference on Multimedia & Expo Workshops, ICME Workshops 2015,
transactions on cybernetics, vol. 47, no. 5, pp. 1180–1197, 2016. Turin, Italy, June 29 - July 3, 2015, 2015, pp. 1–6.
[8] M. Luo, X. Chang, L. Nie, Y. Yang, A. G. Hauptmann, and Q. Zheng, [33] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
“An adaptive semisupervised feature analysis for video semantic recog- convolutional neural networks,” ACM Journal on Emerging Technologies
nition,” IEEE transactions on cybernetics, vol. 48, no. 2, pp. 648–660, in Computing Systems (JETC), vol. 13, no. 3, p. 32, 2017.
2017. [34] Y. Zhou, Y. Zhang, Y. Wang, and Q. Tian, “Accelerate cnn via recursive
[9] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, “Stacked bayesian pruning,” in International Conference on Computer Vision,
convolutional denoising auto-encoders for feature representation,” IEEE 2018.
transactions on cybernetics, vol. 47, no. 4, pp. 1017–1027, 2016.
[35] Y. Zhou, G. G. Yen, and Z. Yi, “A knee-guided evolutionary algorithm
[10] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan, “Cross-modal
for compressing deep neural networks,” IEEE transactions on cybernet-
retrieval with cnn visual features: A new baseline,” IEEE transactions
ics, 2019.
on cybernetics, vol. 47, no. 2, pp. 449–460, 2016.
[11] L. Zhang and P. N. Suganthan, “Visual tracking with convolutional ran- [36] P. T. Fletcher, S. Venkatasubramanian, and S. Joshi, “Robust statistics
dom vector functional link network,” IEEE transactions on cybernetics, on riemannian manifolds via the geometric median,” in International
vol. 47, no. 10, pp. 3243–3253, 2016. Conference on Computer Vision and Pattern Recognition. IEEE, 2008,
[12] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning pp. 1–8.
efficient convolutional networks through network slimming,” in Inter- [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
national Conference on Computer Vision, 2017, pp. 2736–2744. image recognition,” in International Conference on Computer Vision
[13] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and and Pattern Recognition, 2016, pp. 770–778.
E. Choi, “Morphnet: Fast & simple resource-constrained structure learn- [38] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
ing of deep networks,” in International Conference on Computer Vision connected convolutional networks,” in International Conference on
and Pattern Recognition, 2018, pp. 1586–1595. Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
filters for efficient convnets,” in International Conference on Learning large-scale image recognition,” in ICLR, 2015.
Representations, 2017. [40] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct Neural Architecture
[15] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured Search on Target Task and Hardware,” in International Conference on
sparsity in deep neural networks,” in Neural Information Processing Learning Representations, 2019.
Systems, 2016, pp. 2074–2082. [41] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan,
[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep Y. Hu, Y. Wu, Y. Jia et al., “Chamnet: Towards efficient network design
network training by reducing internal covariate shift,” in International through platform-aware model adaptation,” in International Conference
Conference on Machine Learning, 2015. on Computer Vision and Pattern Recognition, 2019, pp. 11 398–11 407.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
[42] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, [64] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture “Reading digits in natural images with unsupervised feature learning,”
search,” in European Conference on Computer Vision, 2018, pp. 19–34. in NeurIPS Workshop on Deep Learning and Unsupervised Feature
[43] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural Learning, 2011.
architecture search via parameter sharing,” in International Conference [65] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
on Machine Learning, 2018. networks,” in European Conference on Computer Vision. Springer,
[44] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, 2016, pp. 630–645.
and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for [66] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method
mobile,” in International Conference on Computer Vision and Pattern for deep neural network compression,” in International Conference on
Recognition, 2019, pp. 2820–2828. Computer Vision, 2017, pp. 5058–5066.
[45] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, [67] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the
Y. Jia, and K. Keutzer, “FBNet: Hardware-aware efficient convnet value of network pruning,” in International Conference on Learning
design via differentiable neural architecture search,” in International Representations, 2019.
Conference on Computer Vision and Pattern Recognition, 2019, pp. [68] S. Lin, R. Ji, Y. Li, C. Deng, and X. Li, “Toward compact convnets
10 734–10 742. via structure-sparsity regularized filter pruning,” IEEE transactions on
[46] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement neural networks and learning systems, 2019.
learning,” in International Conference on Learning Representations,
2017.
[47] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting linear structure within convolutional networks for efficient
evaluation,” in Neural Information Processing Systems, 2014, pp. 1269–
1277.
[48] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for
small-footprint deep learning,” in Neural Information Processing Sys-
tems, 2015, pp. 3088–3096.
[49] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu-
tional networks for classification and detection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–
1955, 2015.
[50] H. Huang and H. Yu, “Ltnn: A layerwise tensorized compression of
multilayer neural network,” IEEE transactions on neural networks and
learning systems, vol. 30, no. 5, pp. 1497–1511, 2018.
[51] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compress-
ing neural networks with the hashing trick,” in International Conference
on Machine Learning, 2015, pp. 2285–2294.
[52] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
gio, “Binarized neural networks: Training deep neural networks with
weights and activations constrained to +1 or-1,” arXiv preprint
arXiv:1602.02830, 2016.
[53] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–
542.
[54] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
neural networks for mobile devices,” in International Conference on
Computer Vision and Pattern Recognition, 2016, pp. 4820–4828.
[55] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: A
unified approach to accelerate and compress convolutional networks,”
IEEE transactions on neural networks and learning systems, vol. 29,
no. 10, pp. 4730–4743, 2017.
[56] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in International Conference on Machine Learn-
ing, 2010, pp. 807–814.
[57] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in International Conference
on Machine Learning, 2013.
[58] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep
network learning by exponential linear units (elus),” in International
Conference on Learning Representations, 2016.
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
International Conference on Computer Vision, 2015, pp. 1026–1034.
[60] Y. Li, Z. Kuang, Y. Chen, and W. Zhang, “Data-driven neuron allo-
cation for scale aggregation networks,” in International Conference on
Computer Vision and Pattern Recognition, 2019, pp. 11 526–11 534.
[61] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
from tiny images,” Citeseer, Tech. Rep., 2009.
[62] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252,
2015.
[63] B. Steiner, Z. DeVito, S. Chintala, S. Gross, A. Paszke, F. Massa,
A. Lerer, G. Chanan, Z. Lin, E. Yang et al., “Pytorch: An imperative
style, high-performance deep learning library,” in Neural Information
Processing Systems, 2019.