0% found this document useful (0 votes)
47 views18 pages

MLSys 2020 What Is The State of Neural Network Pruning Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views18 pages

MLSys 2020 What Is The State of Neural Network Pruning Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

W HAT IS THE S TATE OF N EURAL N ETWORK P RUNING ?

Davis Blalock * 1 Jose Javier Gonzalez Ortiz * 1 Jonathan Frankle 1 John Guttag 1

A BSTRACT
Neural network pruning—the task of reducing the size of a network by removing parameters—has been the
subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview
of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers
and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers
from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to
compare pruning techniques to one another or determine how much progress the field has made over the past
three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and
introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We
use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent
common pitfalls when comparing pruning methods.

1 I NTRODUCTION networks without reducing accuracy, and many pruning


methods outperform random pruning. However, our cen-
Much of the progress in machine learning in the past
tral finding is that the state of the literature is such that our
decade has been a result of deep neural networks. Many
motivating questions are impossible to answer. Few papers
of these networks, particularly those that perform the best
compare to one another, and methodologies are so inconsis-
(Huang et al., 2018), require enormous amounts of compu-
tent between papers that we could not make these compar-
tation and memory. These requirements not only increase
isons ourselves. For example, a quarter of papers compare
infrastructure costs, but also make deployment of net-
to no other pruning method, half of papers compare to at
works to resource-constrained environments such as mo-
most one other method, and dozens of methods have never
bile phones or smart devices challenging (Han et al., 2015;
been compared to by any subsequent work. In addition,
Sze et al., 2017; Yang et al., 2017).
no dataset/network pair appears in even a third of papers,
One popular approach for reducing these resource require- evaluation metrics differ widely, and hyperparameters and
ments at test time is neural network pruning, which entails other counfounders vary or are left unspecified.
systematically removing parameters from an existing net-
Most of these issues stem from the absence of standard
work. Typically, the initial network is large and accurate,
datasets, networks, metrics, and experimental practices. To
and the goal is to produce a smaller network with simi-
help enable more comparable pruning research, we identify
lar accuracy. Pruning has been used since the late 1980s
specific impediments and pitfalls, recommend best prac-
(Janowsky, 1989; Mozer & Smolensky, 1989a;b; Karnin,
tices, and introduce ShrinkBench, a library for standard-
1990), but has seen an explosion of interest in the past
ized evaluation of pruning. ShrinkBench makes it easy to
decade thanks to the rise of deep neural networks.
adhere to the best practices we identify, largely by provid-
For this study, we surveyed 81 recent papers on pruning ing a standardized collection of pruning primitives, models,
in the hopes of extracting practical lessons for the broader datasets, and training routines.
community. For example: which technique achieves the
Our contributions are as follows:
best accuracy/efficiency tradeoff? Are there strategies that
work best on specific architectures or datasets? Which 1. A meta-analysis of the neural network pruning litera-
high-level design choices are most effective? ture based on comprehensively aggregating reported re-
sults from 81 papers.
There are indeed several consistent results: pruning param-
2. A catalog of problems in the literature and best prac-
eters based on their magnitudes substantially compresses
tices for avoiding them. These insights derive from an-
*
Equal contribution 1 MIT CSAIL, Cambridge, MA, USA. alyzing existing work and pruning hundreds of models.
Correspondence to: Davis Blalock <[email protected]>. 3. ShrinkBench, an open-source library for evaluating
Proceedings of the 3 rd MLSys Conference, Austin, TX, USA, neural network pruning methods available at
2020. Copyright 2020 by the author(s). https://github.com/jjgo/shrinkbench.
What is the State of Neural Network Pruning?

2 OVERVIEW OF P RUNING Algorithm 1 Pruning and Fine-Tuning


Before proceeding, we first offer some background on neu- Input: N , the number of iterations of pruning, and
ral network pruning and a high-level overview of how ex- X, the dataset on which to train and fine-tune
isting pruning methods typically work. 1: W ← initialize()
2: W ← trainT oConvergence(f (X; W ))
3: M ← 1|W |
2.1 Definitions
4: for i in 1 to N do
We define a neural network architecture as a function fam- 5: M ← prune(M, score(W ))
ily f (x; ·). The architecture consists of the configuration of 6: W ← f ineT une(f (X; M W ))
the network’s parameters and the sets of operations it uses 7: end for
to produce outputs from inputs, including the arrangement 8: return M, W
of parameters into convolutions, activation functions, pool-
ing, batch normalization, etc. Example architectures in- network, which—although smaller in terms of parameter-
clude AlexNet and ResNet-56. We define a neural network count—may not be arranged in a fashion conducive to
model as a particular parameterization of an architecture, speedups using modern libraries and hardware. Other
i.e., f (x; W ) for specific parameters W . Neural network methods consider parameters in groups (structured prun-
pruning entails taking as input a model f (x; W ) and pro- ing), removing entire neurons, filters, or channels to ex-
ducing a new model f (x; M W 0 ). Here W 0 is set of ploit hardware and software optimized for dense computa-
parameters that may be different from W , M ∈ {0, 1}|W |
0
tion (Li et al., 2016; He et al., 2017).
is a binary mask that fixes certain parameters to 0, and is Scoring. It is common to score parameters based on their
the elementwise product operator. In practice, rather than absolute values, trained importance coefficients, or contri-
using an explicit mask, pruned parameters of W are fixed butions to network activations or gradients. Some prun-
to zero or removed entirely. ing methods compare scores locally, pruning a fraction of
the parameters with the lowest scores within each struc-
2.2 High-Level Algorithm tural subcomponent of the network (e.g., layers) (Han et al.,
2015). Others consider scores globally, comparing scores
There are many methods of producing a pruned model
to one another irrespective of the part of the network in
f (x; M W 0 ) from an initially untrained model f (x; W0 ),
which the parameter resides (Lee et al., 2019b; Frankle &
where W0 is sampled from an initialization distribution D.
Carbin, 2019).
Nearly all neural network pruning strategies in our survey
derive from Algorithm 1 (Han et al., 2015). In this algo- Scheduling. Pruning methods differ in the amount of the
rithm, the network is first trained to convergence. After- network to prune at each step. Some methods prune all
wards, each parameter or structural element in the network desired weights at once in a single step (Liu et al., 2019).
is issued a score, and the network is pruned based on these Others prune a fixed fraction of the network iteratively over
scores. Pruning reduces the accuracy of the network, so several steps (Han et al., 2015) or vary the rate of pruning
it is trained further (known as fine-tuning) to recover. The according to a more complex function (Gale et al., 2019).
process of pruning and fine-tuning is often iterated several
Fine-tuning. For methods that involve fine-tuning, it is
times, gradually reducing the network’s size.
most common to continue to train the network using the
Many papers propose slight variations of this algorithm. trained weights from before pruning. Alternative propos-
For example, some papers prune periodically during train- als include rewinding the network to an earlier state (Fran-
ing (Gale et al., 2019) or even at initialization (Lee et al., kle et al., 2019) and reinitializing the network entirely (Liu
2019b). Others modify the network to explicitly include et al., 2019).
additional parameters that encourage sparsity and serve as
a basis for scoring the network after training (Molchanov 2.4 Evaluating Pruning
et al., 2017).
Pruning can accomplish many different goals, including re-
ducing the storage footprint of the neural network, the com-
2.3 Differences Betweeen Pruning Methods
putational cost of inference, the energy requirements of in-
Within the framework of Algorithm 1, pruning methods ference, etc. Each of these goals favors different design
vary primarily in their choices regarding sparsity structure, choices and requires different evaluation metrics. For ex-
scoring, scheduling, and fine-tuning. ample, when reducing the storage footprint of the network,
all parameters can be treated equally, meaning one should
Structure. Some methods prune individual parameters
evaluate the overall compression ratio achieved by prun-
(unstructured pruning). Doing so produces a sparse neural
ing. However, when reducing the computational cost of
What is the State of Neural Network Pruning?

inference, different parameters may have different impacts. racy. In fact, for small amounts of compression, pruning
For instance, in convolutional layers, filters applied to spa- can sometimes increase accuracy (Han et al., 2015; Suzuki
tially larger inputs are associated with more computation et al., 2018). This basic finding has been replicated in a
than those applied to smaller inputs. large fraction of the papers in our corpus.
Regardless of the goal, pruning imposes a tradeoff between Along the same lines, it has been repeatedly shown that, at
model efficiency and quality, with pruning increasing the least for large amounts of pruning, many pruning methods
former while (typically) decreasing the latter. This means outperform random pruning (Yu et al., 2018; Gale et al.,
that a pruning method is best characterized not by a single 2019; Frankle et al., 2019; Mariet & Sra, 2015; Suau et al.,
model it has pruned, but by a family of models correspond- 2018; He et al., 2017). Interestingly, this does not always
ing to different points on the efficiency-quality curve. To hold for small amounts of pruning (Morcos et al., 2019).
quantify efficiency, most papers report at least one of two Similarly, pruning all layers uniformly tends to perform
metrics. The first is the number of multiply-adds (often worse than intelligently allocating parameters to different
referred to as FLOPs) required to perform inference with layers (Gale et al., 2019; Han et al., 2015; Li et al., 2016;
the pruned network. The second is the fraction of param- Molchanov et al., 2016; Luo et al., 2017) or pruning glob-
eters pruned. To measure quality, nearly all papers report ally (Lee et al., 2019b; Frankle & Carbin, 2019). Lastly,
changes in Top-1 or Top-5 image classification accuracy. when holding the number of fine-tuning iterations constant,
many methods produce pruned models that outperform re-
As others have noted (Lebedev et al., 2014; Figurnov et al.,
training from scratch with the same sparsity pattern (Zhang
2016; Louizos et al., 2017; Yang et al., 2017; Han et al.,
et al., 2015; Yu et al., 2018; Louizos et al., 2017; He et al.,
2015; Kim et al., 2015; Wen et al., 2016; Luo et al., 2017;
2017; Luo et al., 2017; Frankle & Carbin, 2019) (at least
He et al., 2018b), these metrics are far from perfect. Param-
with a large enough amount of pruning (Suau et al., 2018)).
eter and FLOP counts are a loose proxy for real-world la-
Retraining from scratch in this context means training a
tency, throughout, memory usage, and power consumption.
fresh, randomly-initialized model with all weights clamped
Similarly, image classification is only one of the countless
to zero throughout training, except those that are nonzero
tasks to which neural networks have been applied. How-
in the pruned model.
ever, because the overwhelming majority of papers in our
corpus focus on these metrics, our meta-analysis necessar- Another consistent finding is that sparse models tend to
ily does as well. outperform dense ones for a fixed number of parameters.
Lee et al. (2019a) show that increasing the nominal size
3 L ESSONS FROM THE L ITERATURE of ResNet-20 on CIFAR-10 while sparsifying to hold the
number of parameters constant decreases the error rate.
After aggregating results from a corpus of 81 papers, we
Kalchbrenner et al. (2018) obtain a similar result for audio
identified a number of consistent findings. In this section,
synthesis, as do Gray et al. (2017) for a variety of additional
we provide an overview of our corpus and then discuss
tasks across various domains. Perhaps most compelling of
these findings.
all are the many results, including in Figure 1, showing that
3.1 Papers Used in Our Analysis pruned models can obtain higher accuracies than the origi-
nal models from which they are derived. This demonstrates
Our corpus consists of 79 pruning papers published since that sparse models can not only outperform dense counter-
2010 and two classic papers (LeCun et al., 1990; Hassibi parts with the same number of parameters, but sometimes
et al., 1993) that have been compared to by a number of dense models with even more parameters.
recent methods. We selected these papers by identifying
popular papers in the literature and what cites them, sys- 3.3 Pruning vs Architecture Changes
tematically searching through conference proceedings, and
tracing the directed graph of comparisons between prun- One current unknown about pruning is how effective it
ing papers. This last procedure results in the property that, tends to be relative to simply using a more efficient archi-
barring oversights on our part, there is no pruning paper tecture. These options are not mutually exclusive, but it
in our corpus that compares to any pruning paper outside may be useful in guiding one’s research or development
of our corpus. Additional details about our corpus and its efforts to know which choice is likely to have the larger
construction can be found in Appendix A. impact. Along similar lines, it is unclear how pruned mod-
els from different architectures compare to one another—
3.2 How Effective is Pruning? i.e., to what extent does pruning offer similar benefits
across architectures? To address these questions, we plot-
One of the clearest findings about pruning is that it works.
ted the reported accuracies and compression/speedup levels
More precisely, there are various methods that can sig-
of pruned models on ImageNet alongside the same metrics
nificantly compress models with little or no loss of accu-
What is the State of Neural Network Pruning?

for different architectures with no pruning (Figure 1).1 We Speed and Size Tradeoffs for Original and Pruned Models
plot results within a family of models as a single curve.2 85

Top 1 Accuracy (%)


Figure 1 suggests several conclusions. First, it reinforces 80
the conclusion that pruning can improve the time or space
75
vs accuracy tradeoff of a given architecture, sometimes
even increasing the accuracy. Second, it suggests that prun- 70
ing generally does not help as much as switching to a better
architecture. Finally, it suggests that pruning is more effec- 65
tive for architectures that are less efficient to begin with.
96

Top 5 Accuracy (%)


94
4 M ISSING C ONTROLLED C OMPARISONS
92
While there do appear to be a few general and consistent 90

findings in the pruning literature (see the previous section), 88

by far the clearest takeaway is that pruning papers rarely 86


84
make direct and controlled comparisons to existing meth-
ods. This lack of comparisons stems largely from a lack 10
6
10
7
10
8
10
9
10
10

of experimental standardization and the resulting fragmen- Number of Parameters Number of FLOPs
MobileNet-v2 (2018) ResNet (2016) VGG (2014) EfficientNet (2019)
tation in reported results. This fragmentation makes it dif- MobileNet-v2 Pruned ResNet Pruned VGG Pruned
ficult for even the most committed authors to compare to
Figure 1: Size and speed vs accuracy tradeoffs for dif-
more than a few existing methods.
ferent pruning methods and families of architectures.
Pruned models sometimes outperform the original ar-
4.1 Omission of Comparison chitecture, but rarely outperform a better architecture.
Many papers claim to advance the state of the art, but Ignoring Recent Methods Even when considering only
don’t compare to other methods—including many pub- post-2010 approaches, there are still virtually no methods
lished ones—that make the same claim. that have been shown to outperform all existing “state-of-
the-art” methods. This follows from the fact, depicted in
Ignoring Pre-2010s Methods There was already a rich the top plot of Figure 2, that there are dozens of modern
body of work on neural network pruning by the mid 1990s papers—including many affirmed through peer review—
(see, e.g., Reed’s survey (Reed, 1993)), which has been al- that have never been compared to by any later study.
most completely ignored except for Lecun’s Optimal Brain
Damage (LeCun et al., 1990) and Hassibi’s Optimal Brain A related problem is that papers tend to compare to few
Surgeon (Hassibi et al., 1993). Indeed, multiple authors existing methods. In the lower plot of Figure 2, we see
have rediscovered existing methods or aspects thereof, with that more than a fourth of our corpus does not compare
Han et al. (2015) reintroducing the magnitude-based prun- to any previously proposed pruning method, and another
ing of Janowsky (1989), Lee et al. (2019b) reintroducing fourth compares to only one. Nearly all papers compare to
the saliency heuristic of Mozer & Smolensky (1989a), and three or fewer. This might be adequate if there were a clear
He et al. (2018a) reintroducing the practice of “reviving” progression of methods with one or two “best” methods at
previously pruned weights described in Tresp et al. (1997). any given time, but this is not the case.

1 4.2 Dataset and Architecture Fragmentation


Since many pruning papers report only change in accuracy or
amount of pruning, without giving baseline numbers, we normal- Among 81 papers, we found results using 49 datasets, 132
ize all pruning results to have accuracies and model sizes/FLOPs
as if they had begun with the same model. Concretely, this means
architectures, and 195 (dataset, architecture) combinations.
multiplying the reported fraction of pruned size/FLOPs by a stan- As shown in Table 1, even the most common combination
dardized initial value. This value is set to the median initial size or of dataset and architecture—VGG-16 on ImageNet3 (Deng
number of FLOPs reported for that architecture across all papers. et al., 2009)—is used in only 22 out of 81 papers. More-
This normalization scheme is not perfect, but does help control for over, three of the top six most common combinations in-
different methods beginning with different baseline accuracies.
2 volve MNIST (LeCun et al., 1998a). As Gale et al. (2019)
The EfficientNet family is given explicitly in the original pa-
per (Tan & Le, 2019), the ResNet family consists of ResNet- and others have argued, using larger datasets and models is
18, ResNet-34, ResNet-50, etc., and the VGG family consists of essential when assessing how well a method works for real-
VGG-{11, 13, 16, 19}. There are no pruned EfficientNets since 3
EfficientNet was published too recently. Results for non-pruned We adopt the common practice of referring to the
models are taken from (Tan & Le, 2019) and (Bianco et al., 2018). ILSVRC2012 training and validation sets as “ImageNet.”
What is the State of Neural Network Pruning?

compared to this many times Number of Papers Comparing to a Given Paper Number of Papers
32 (Dataset, Architecture) Pair using Pair
Number of papers

28 ImageNet VGG-16 22
24
20 ImageNet ResNet-50 15
16 MNIST LeNet-5-Caffe 14
12 CIFAR-10 ResNet-56 14
8
4 MNIST LeNet-300-100 12
0 MNIST LeNet-5 11
0 3 6 9 12 15 18
Compared to by this many other papers ImageNet CaffeNet 10
Number of Papers a Given Paper Compares To CIFAR-10 CIFAR-VGG (Torch) 8
compare to this many others

ImageNet AlexNet 8
Number of papers that

21
18 ImageNet ResNet-18 6
15 ImageNet ResNet-34 6
12 CIFAR-10 ResNet-110 5
9
6 CIFAR-10 PreResNet-164 4
3 CIFAR-10 ResNet-32 4
0 Table 1: All combinations of dataset and architecture
0 2 4 6 8 10
Compares to this many other papers used in at least 4 out of 81 papers.
Peer-Reviewed Other ods are nearby on the x-axis, it is not clear whether one
Figure 2: Reported comparisons between papers. meaningfully outperforms another since neither reports a
standard deviation or other measure of central tendency. Fi-
world networks. MNIST results may be particularly un- nally, most papers in our corpus do not report any results
likely to generalize, since this dataset differs significantly with any of these common configurations.
from other popular datasets for image classification. In par-
ticular, its images are grayscale, composed mostly of zeros, 4.4 Incomplete Characterization of Results
and possible to classify with over 99% accuracy using sim-
ple models (LeCun et al., 1998b). If all papers reported a wide range of points in their trade-
off curves across a large set of models and datasets, there
4.3 Metrics Fragmentation might be some number of direct comparisons possible be-
tween any given pair of methods. As we see in the upper
As depicted in Figure 3, papers report a wide variety of half of Figure 4, however, most papers use at most three
metrics and operating points, making it difficult to com- (dataset, architecture) pairs; and as we see in the lower half,
pare results. Each column in this figure is one (dataset, ar- they use at most three—and often just one—point to char-
chitecture) combination taken from the four most common acterize each curve. Combined with the fragmentation in
combinations4 , excluding results on MNIST. Each row is experimental choices, this means that different methods’
one pair of metrics. Each curve is the efficiency vs accu- results are rarely directly comparable. Note that the lower
racy tradeoff obtained by one method.5 Methods are color- half restricts results to the four most common (dataset, ar-
coded by year. chitecture) pairs.
It is hard to identify any consistent trends in these plots,
aside from the existence of a tradeoff between efficiency 4.5 Confounding Variables
and accuracy. A given method is only present in a small
Even when comparisons include the same datasets, models,
subset of plots. Methods from later years do not consis-
metrics, and operating points, other confounding variables
tently outperform methods from earlier years. Methods
still make meaningful comparisons difficult. Some vari-
within a plot are often incomparable because they report
ables of particular interest include:
results at different points on the x-axis. Even when meth-
• Accuracy and efficiency of the initial model
4
We combined the results for AlexNet and CaffeNet, which
is a slightly modified version of AlexNet (caf, 2016), since many • Data augmentation and preprocessing
authors refer to the latter as “AlexNet,” and it is often unclear
which model was used. • Random variations in initialization, training, and fine-
5
Since what counts as one method can be unclear, we consider tuning. This includes choice of optimizer, hyperparam-
all results from one paper to be one method except when two or eters, and learning rate schedule.
more named methods within the paper report using at least one
identical x-coordinate (i.e., when the paper’s results can’t be plot- • Pruning and fine-tuning schedule
ted as one curve).
• Deep learning library. Different libraries are known to
What is the State of Neural Network Pruning?

VGG-16 on ImageNet Alex/CaffeNet on ImageNet ResNet-50 on ImageNet ResNet-56 on CIFAR-10


4 0.5
0 0
Top-1 Accuracy (%)

2 1 0.0
Change in

1
0 2
0.5
2 3
2
1.0
4
4 3
0 1 2 3 4 1 2 3 4 0 1 2 3 4 0 2 4
Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio)
0
Top-5 Accuracy (%)

2 0
1
1
Change in

0 2 1
1 3 2
2
4
3
0 1 2 3 0 2 4 0.0 0.5 1.0 1.5 2.0
Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio)
4 0.0 0
Top-1 Accuracy (%)

0.5 0
2 1
Change in

1.0 1
0 2
1.5
3 2
2 2.0
2.5 4 3
4
2 4 6 1 2 3 2 3 1 2 3
Theoretical Speedup Theoretical Speedup Theoretical Speedup Theoretical Speedup

2 0 0
Top-5 Accuracy (%)

0 2
Change in

4 1
2
6
4 2
8
6
10
2 4 6 8 10 2 4 6 2 3
Theoretical Speedup Theoretical Speedup Theoretical Speedup

Collins 2014 Kim 2016 Lin 2017 Dubey 2018, AP+Coreset-K Peng 2018 Choi 2019
Han 2015 Srinivas 2016 Luo 2017 Dubey 2018, AP+Coreset-S Suau 2018, PFA-En Gale 2019, Magnitude-v2
Zhang 2015 Wen 2016 Srinivas 2017 He, Yang 2018 Suau 2018, PFA-KL Kim 2019
Figurnov 2016 Alvarez 2017 Yang 2017 He, Yang 2018, Fine-Tune Suzuki 2018 Liu 2019, Scratch-B
Guo 2016 He 2017 Carreira-Perpinan 2018 He, Yihui 2018 Yamamoto 2018 Luo 2019
Han 2016 He 2017, 3C Ding 2018 Huang 2018 Yu 2018 Peng 2019, CCP
Hu 2016 Li 2017 Dubey 2018, AP+Coreset-A Lin 2018 Zhuang 2018 Peng 2019, CCP-AC

Figure 3: Fragmentation of results. Shown are all self-reported results on the most common (dataset, architecture)
combinations. Each column is one combination, each row shares an accuracy metric (y-axis), and pairs of rows
share a compression metric (x-axis). Up and to the right is always better. Standard deviations are shown for He
2018 on CIFAR-10, which is the only result that provides any measure of central tendency. As suggested by the
legend, only 37 out of the 81 papers in our corpus report any results using any of these configurations.

yield different accuracies for the same architecture and both used the same code as the methods to which it com-
dataset (Northcutt, 2019; Nola, 2016) and may have sub- pares and reports enough measurements to average out ran-
tly different behaviors (Vryniotis, 2018). dom variations. This is exceptionally rare, with Gale et al.
(2019) and Liu et al. (2019) being arguably the only ex-
• Subtle differences in code and environment that may
amples. Moreover, neither of these papers introduce novel
not be easily attributable to any of the above variations
pruning methods per se but are instead inquiries into the
(Crall, 2018; Jogeshwar, 2017; unr, 2017).
efficacy of existing methods.
In general, it is not clear that any paper can succeed in ac- Many papers attempt to account for subsets of these con-
counting for all of these confounders unless that paper has founding variables. A near universal practice in this re-
What is the State of Neural Network Pruning?

21
Number of (Dataset, Architecture) Pairs Used Pruning ResNet-50 with Unstructured Magnitude-Based Pruning
76
using this many pairs

18
Number of papers

Top 1 Accuracy (%)


15 74
12
72
9
6 70
3
68
0
2 4 6 8 10 12 14 16 18 20
Number of pairs Pruning ResNet-50 with All Other Methods
Number of Points used to Characterize Tradeoff Curve 76
27
using this many points

Top 1 Accuracy (%)


24
Number of curves

21 74
18
15 72
12
9 70
6 68
3
0
1 2 3 4 5 6 7 8 9 10
6 7
10
Number of points Number of Parameters
Peer-Reviewed Other Frankle 2019, PruneAtEpoch=15 Dubey 2018, AP+Coreset-K
Frankle 2019, PruneAtEpoch=90 Dubey 2018, AP+Coreset-S
Figure 4: Number of results reported by each paper, Frankle 2019, ResetToEpoch=10 Gale 2019, SparseVD
excluding MNIST. Top) Most papers report on three or Frankle 2019, ResetToEpoch=R Huang 2018
Gale 2019, Magnitude Lin 2018
fewer (dataset, architecture) pairs. Bottom) For each Gale 2019, Magnitude-v2 Liu 2019, Scratch-B
pair used, most papers characterize their tradeoff be- Liu 2019, Magnitude Luo 2017
tween amount of pruning and accuracy using a single Alvarez 2017 Yamamoto 2018
Dubey 2018, AP+Coreset-A Zhuang 2018
point in the efficiency vs accuracy curve. In both plots,
the pattern holds even for peer-reviewed papers. Figure 5: Pruning ResNet-50 on ImageNet. Methods in
the upper plot all prune weights with the smallest mag-
gard is reporting change in accuracy relative to the original nitudes, but differ in implementation, pruning sched-
model, in addition to or instead of raw accuracy. This helps ule, and fine-tuning. The variation caused by these vari-
to control for the accuracy of the initial model. However, as ables is similar to the variation across different pruning
we demonstrate in Section 7, this is not sufficient to remove methods, whose results are shown in the lower plot. All
initial model as a confounder. Certain initial models can be results are taken from the original papers.
pruned more or less efficiently, in terms of the accuracy vs
compression tradeoff. This holds true even with identical
pruning methods and all other variables held constant. 5 F URTHER BARRIERS TO C OMPARISON
In the previous section, we discussed the fragmentation of
There are at least two more empirical reasons to believe that datasets, models, metrics, operating points, and experimen-
confounding variables can have a significant impact. First, tal details, and how this fragmentation makes evaluating
as one can observe in Figure 3, methods often introduce the efficacy of individual pruning methods difficult. In this
changes in accuracy of much less than 1% at reported op- section, we argue that there are additional barriers to com-
erating points. This means that, even if confounders have paring methods that stem from common practices in how
only a tiny impact on accuracy, they can still have a large methods and results are presented.
impact on which method appears better.
5.1 Architecture Ambiguity
Second, as shown in Figure 5, existing results demonstrate
that different training and fine-tuning settings can yield It is often difficult, or even impossible, to identify the exact
nearly as much variability as different methods. Specif- architecture that authors used. Perhaps the most prevalent
ically, consider 1) the variability introduced by differ- example of this is when authors report using some sort of
ent fine-tuning methods for unstructured magnitude-based ResNet (He et al., 2016a;b). Because there are two different
pruning (Figure 6 top) and 2) the variability introduced by variations of ResNets, introduced in these two papers, say-
entirely different pruning methods (Figure 6 bottom). The ing that one used a “ResNet-50” is insufficient to identify a
variability between fine-tuning methods is nearly as large particular architecture. Some authors do appear to deliber-
as the variability between pruning methods. ately point out the type of ResNet they use (e.g., (Liu et al.,
2017; Dong et al., 2017)). However, given that few papers
What is the State of Neural Network Pruning?

even hint at the possibility of confusion, it seems unlikely times never made clear. Even when reporting FLOPs,
that all authors are even aware of the ambiguity, let alone which is nominally a consistent metric, different authors
that they have cited the corresponding paper in all cases. measure it differently (e.g., (Molchanov et al., 2016) vs
(Wang & Cheng, 2016)), though most often papers entirely
Perhaps the greatest confusion is over VGG networks (Si-
omit their formula for computing FLOPs. We found up
monyan & Zisserman, 2014). Many papers describe exper-
to a factor of four variation in the reported FLOPs of dif-
imenting on “VGG-16,” “VGG,” or “VGGNet,” suggesting
ferent papers for the same architecture and dataset, with
a standard and well-known architecture. In many cases,
(Yang et al., 2017) reporting 371 MFLOPs for AlexNet on
what is actually used is a custom variation of some VGG
ImageNet, (Choi et al., 2019) reporting 724 MFLOPs, and
model, with removed fully-connected layers (Changpinyo
(Han et al., 2015) reporting 1500 MFLOPs.
et al., 2017; Luo et al., 2017), smaller fully-connected lay-
ers (Lee et al., 2019b), or added dropout or batchnorm (Liu
et al., 2017; Lee et al., 2019b; Peng et al., 2018; Molchanov 6 S UMMARY AND R ECOMMENDATIONS
et al., 2017; Ding et al., 2018; Suau et al., 2018).
In the previous sections, we have argued that existing work
In some cases, papers simply fail to make clear what model tends to
they used (even for non-VGG architectures). For exam- • make it difficult to identify the exact experimental setup
ple, one paper just states that their segmentation model and metrics,
“is composed from an inception-like network branch and a
• use too few (dataset, architecture) combinations,
DenseNet network branch.” Another paper attributes their
VGGNet to (Parkhi et al., 2015), which mentions three • report too few points in the tradeoff curve for any given
VGG networks. Liu et al. (2019) and Frankle & Carbin combination, and no measures of central tendency,
(2019) have circular references to one another that can no • omit comparison to many methods that might be state-
longer be resolved because of simultaneous revisions. One of-the-art, and
paper mentions using a “VGG-S” from the Caffe Model • fail to control for confounding variables.
Zoo, but as of this writing, no model with this name ex-
These problems often make it difficult or impossible to as-
ists there. Perhaps the most confusing case is the Lenet-
sess the relative efficacy of different pruning methods. To
5-Caffe reported in one 2017 paper. The authors are to
enable direct comparison between methods in the future,
be commended for explicitly stating not only that they use
we suggest the following practices:
Lenet-5-Caffe, but their exact architecture. However, they
describe an architecture with an 800-unit fully-connected • Identify the exact sets of architectures, datasets, and
layer, while examination of both the Caffe .prototxt metrics used, ideally in a structured way that is not scat-
files (Jia et al., 2015a;b) and associated blog post (Jia et al., tered throughout the results section.
2016) indicates that no such layer exists in Lenet-5-Caffe. • Use at least three (dataset, architecture) pairs, including
modern, large-scale ones. MNIST and toy models do
5.2 Metrics Ambiguity not count. AlexNet, CaffeNet, and Lenet-5 are no longer
modern architectures.
It can also be difficult to know what the reported metrics
mean. For example, many papers include a metric along • For any given pruned model, report both compression
the lines of “Pruned%”. In some cases, this means frac- ratio and theoretical speedup. Compression ratio is de-
tion of the parameters or FLOPs remaining (Suau et al., fined as the original size divided by the new size. The-
2018). In other cases, it means the fraction of parameters or oretical speedup is defined as the original number of
FLOPs removed (Han et al., 2015; Lebedev & Lempitsky, multiply-adds divided by the new number. Note that
2016; Yao et al., 2018). There is also widespread misuse of there is no reason to report only one of these metrics.
the term “compression ratio,” which the compression liter- • For ImageNet and other many-class datasets, report both
original size
ature has long used to mean compressed size (Siedelmann et al., Top-1 and Top-5 accuracy. There is again no reason to
2015; Zukowski et al., 2006; Zhao et al., 2015; Lindstrom, report only one of these.
2014; Ratanaworabhan et al., 2006; Blalock et al., 2018), • Whatever metrics one reports for a given pruned model,
but many pruning authors define (usually without making also report these metrics for an appropriate control (usu-
the formula explicit) as 1 − compressed size
original size . ally the original model before pruning).
Reported “speedup” values present similar challenges. • Plot the tradeoff curve for a given dataset and architec-
These values are sometimes wall time, sometimes original ture, alongside the curves for competing methods.
number of FLOPs divided by pruned number of FLOPs, • When plotting tradeoff curves, use at least 5 operating
sometimes a more complex formula relating these two points spanning a range of compression ratios. The set
quantities (Dong et al., 2017; He et al., 2018a), and some- of ratios {2, 4, 8, 16, 32} is a good choice.
What is the State of Neural Network Pruning?

• Report and plot means and sample standard deviations, complex methods (Han et al., 2015; 2016; Gale et al., 2019;
instead of one-off measurements, whenever feasible. Frankle et al., 2019). Gradient-based methods are less com-
• Ensure that all methods being compared use identical mon, but are simple to implement and have recently gained
libraries, data loading, and other code to the greatest ex- popularity (Lee et al., 2019b;a; Yu et al., 2018). Random
tent possible. pruning is a common straw man that can serve as a useful
debugging tool. Note that these baselines are not reproduc-
We also recommend that reviewers demand a much greater tions of any of these methods, but merely inspired by their
level of rigor when evaluating papers that claim to offer a pruning heuristics.
better method of pruning neural networks.
7.3 Avoiding Pruning Pitfalls with Shrinkbench
7 S HRINK B ENCH Using the described baselines, we pruned over 800 net-
works with varying datasets, networks, compression ratios,
7.1 Overview of ShrinkBench
initial weights and random seeds. In doing so, we identi-
To make it as easy as possible for researchers to put our fied various pitfalls associated with experimental practices
suggestions into practice, we have created an open-source that are currently common in the literature but are avoided
library for pruning called ShrinkBench. ShrinkBench pro- by using ShrinkBench.
vides standardized and extensible functionality for training, We highlight several noteworthy results below. For addi-
pruning, fine-tuning, computing metrics, and plotting, all tional experimental results and details, see Appendix D.
using a standardized set of pretrained models and datasets. One standard deviation bars across three runs are shown
ShrinkBench is based on PyTorch (Paszke et al., 2017) and for all CIFAR-10 results.
is designed to allow easy evaluation of methods with ar-
bitrary scoring functions, allocation of pruning across lay- Metrics are not Interchangeable. As discussed previ-
ers, and sparsity structures. In particular, given a callback ously, it is common practice to report either reduction in the
defining how to compute masks for a model’s parameter number of parameters or in the number of FLOPs. If these
tensors at a given iteration, ShrinkBench will automati- metrics are extremely correlated, reporting only one is suf-
cally apply the pruning, update the network according to a ficient to characterize the efficacy of a pruning method. We
standard training or fine-tuning setup, and compute metrics found after computing these metrics for the same model un-
across many models, datasets, random seeds, and levels of der many different settings that reporting one metric is not
pruning. We defer discussion of ShrinkBench’s implemen- sufficient. While these metrics are correlated, the correla-
tation and API to the project’s documentation. tion is different for each pruning method. Thus, the relative
performance of different methods can vary significantly un-
der different metrics (Figure 6).
7.2 Baselines
ResNet-18 on ImageNet
We used ShrinkBench to implement several existing prun- 0.70
ing heuristics, both as examples of how to use our library
and as baselines that new methods can compare to: 0.65
0.60
Accuracy

• Global Magnitude Pruning - prunes the weights with 0.55


the lowest absolute value anywhere in the network. Global Weight
0.50 Layer Weight
• Layerwise Magnitude Pruning - for each layer, prunes Global Gradient
0.45
the weights with the lowest absolute value. Layer Gradient
0.40
• Global Gradient Magnitude Pruning - prunes the 1 2 4 8 16 1 2 4 8 16 32
weights with the lowest absolute value of (weight × gra- Compression Ratio Theoretical Speedup
dient), evaluated on a batch of inputs. Figure 6: Top 1 Accuracy for ResNet-18 on ImageNet
for several compression ratios and their corresponding
• Layerwise Gradient Magnitude Pruning - for each
theoretical speedups. Global methods give higher accu-
layer, prunes the weights the lowest absolute value of
racy than Layerwise ones for a fixed model size, but the
(weight × gradient), evaluated on a batch of inputs.
reverse is true for a fixed theoretical speedup.
• Random Pruning - prunes each weight independently
with probability equal to the fraction of the network to Results Vary Across Models, Datasets, and Pruning
be pruned. Amounts Many methods report results on only a small
number of datasets, models, amounts of pruning, and ran-
Magnitude-based approaches are common baselines in the dom seeds. If the relative performance of different methods
literature and have been shown to be competitive with more tends to be constant across all of these variables, this may
What is the State of Neural Network Pruning?

not be problematic. However, our results suggest that this Absolute Relative
0.0
performance is not constant. 0.9
0.1
Figure 7 shows the accuracy for various compression ra- 0.8

Accuracy
Accuracy
tios for CIFAR-VGG (Zagoruyko, 2015) and ResNet-56 0.2
on CIFAR-10. In general, Global methods are more accu- 0.7 Global A 0.3
rate than Layerwise methods and Magnitude-based meth- Global B
0.6 Layer A 0.4
ods are more accurate than Gradient-based methods, with
Layer B 0.5
random performing worst of all. However, if one were to 0.5
look only at CIFAR-VGG for compression ratios smaller 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Compression Ratio Compression Ratio
than 10, one could conclude that Global Gradient outper-
forms all other methods. Similarly, while Global Gradient Figure 8: Global and Layerwise Magnitude Pruning on
consistently outperforms Layerwise Magnitude on CIFAR- two different ResNet-56 models. Even with all other
VGG, the opposite holds on ResNet-56 (i.e., the orange and variables held constant, different initial models yield
green lines switch places). different tradeoff curves. This may cause one method
to erroneously appear better than another. Controlling
Moreover, we found that for some settings close to the for initial accuracy does not fix this.
drop-off point (such as Global Gradient, compression 16),
different random seeds yielded significantly different re- We also found that the common practice of examining
sults (0.88 vs 0.61 accuracy) due to the randomness in changes in accuracy is insufficient to correct for initial
minibatch selection. This is illustrated by the large verti- model as a confounder. Even when reporting changes, one
cal error bar in the left subplot. pruning method can artificially appear better than another
by virtue of beginning with a different model. We see this
CIFAR-VGG ResNet-56
on the right side of Figure 8, where Layerwise Magnitude
0.9 with Weights B appears to outperform Global Magnitude
with Weights A, even though the former never outperforms
0.8
Accuracy

the latter when initial model is held constant.


0.7 Global Weight
Layer Weight
Global Gradient 8 C ONCLUSION
0.6 Layer Gradient
Random Considering the enormous interest in neural network prun-
0.5 ing over the past decade, it seems natural to ask simple
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Compression Ratio questions about the relative efficacy of different pruning
Figure 7: Top 1 Accuracy on CIFAR-10 for several com- techniques. Although a few basic findings are shared across
pression ratios. Global Gradient performs better than the literature, missing baselines and inconsistent experi-
Global Magnitude for CIFAR-VGG on low compression mental settings make it impossible to assess the state of
ratios, but worse otherwise. Global Gradient is con- the art or confidently compare the dozens of techniques
sistently better than Layerwise Magnitude on CIFAR- proposed in recent years. After carefully studying the
VGG, but consistently worse on ResNet-56. literature and enumerating numerous areas of incompa-
rability and confusion, we suggest concrete remedies in
the form of a list of best practices and an open-source
Using the Same Initial Model is Essential. As men-
library—ShrinkBench—to help future research endeavors
tioned in Section 4.5, many methods are evaluated using
to produce the kinds of results that will harmonize the lit-
different initial models with the same architecture. To as-
erature and make our motivating questions easier to an-
sess whether beginning with a different model can skew
swer. Furthermore, ShrinkBench results on various pruning
the results, we created two different models and evaluated
techniques evidence the need for standardized experiments
Global vs Layerwise Magnitude pruning on each with all
when evaluating neural network pruning methods.
other variables held constant.
To obtain the models, we trained two ResNet-56 networks
ACKNOWLEDGEMENTS
using Adam until convergence with η = 10−3 and η =
10−4 . We’ll refer to these pretrained weights as Weights We thank Luigi Celona for providing the data used in
A and Weights B, respectively. As shown on the left side (Bianco et al., 2018) and Vivienne Sze for helpful discus-
of Figure 8, the different methods appear better on differ- sion. This research was supported by the Qualcomm Inno-
ent models. With Weights A, the methods yield similar vation Fellowship, the “la Caixa” Foundation Fellowship,
absolute accuracies. With Weights B, however, the Global Quanta Computer, and Wistron Corporation.
method is more accurate at higher compression ratios.
What is the State of Neural Network Pruning?

R EFERENCES Frankle, J. and Carbin, M. The lottery ticket hypothesis:


Finding sparse, trainable neural networks. In 7th Inter-
What’s the advantage of the reference caffenet in com-
national Conference on Learning Representations, ICLR
parison with the alexnet? https://github.com/
2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe-
BVLC/caffe/issues/4202, 5 2016. Accessed:
view.net, 2019. URL https://openreview.net/
2019-07-22.
forum?id=rJl-b3RcF7.
Keras exported model shows very low accuracy in
Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M.
tensorflow serving. https://github.com/
The lottery ticket hypothesis at scale. arXiv preprint
keras-team/keras/issues/7848, 9 2017.
arXiv:1903.01611, 2019.
Accessed: 2019-07-22.
Gale, T., Elsen, E., and Hooker, S. The state of sparsity in
Bianco, S., Cadene, R., Celona, L., and Napoletano, P. deep neural networks, 2019.
Benchmark analysis of representative deep neural net-
work architectures. IEEE Access, 6:64270–64277, 2018. Gray, S., Radford, A., and Kingma, D. P. Gpu kernels for
block-sparse weights. arXiv preprint arXiv:1711.09224,
Blalock, D., Madden, S., and Guttag, J. Sprintz: Time se- 2017.
ries compression for the internet of things. Proceedings
of the ACM on Interactive, Mobile, Wearable and Ubiq- Han, S., Pool, J., Tran, J., and Dally, W. Learning both
uitous Technologies, 2(3):93, 2018. weights and connections for efficient neural network. In
Advances in neural information processing systems, pp.
Changpinyo, S., Sandler, M., and Zhmoginov, A. The 1135–1143, 2015.
power of sparsity in convolutional neural networks.
arXiv preprint arXiv:1702.06257, 2017. Han, S., Mao, H., and Dally, W. J. Deep compression:
Compressing deep neural network with pruning, trained
Choi, Y., El-Khamy, M., and Lee, J. Jointly sparse convolu- quantization and huffman coding. In Bengio, Y. and Le-
tional neural networks in dual spatial-winograd domains. Cun, Y. (eds.), 4th International Conference on Learn-
arXiv preprint arXiv:1902.08192, 2019. ing Representations, ICLR 2016, San Juan, Puerto Rico,
May 2-4, 2016, Conference Track Proceedings, 2016.
Crall, J. Accuracy of resnet50 is much higher than
URL http://arxiv.org/abs/1510.00149.
reported! https://github.com/kuangliu/
pytorch-cifar/issues/45, 2018. Accessed: Hassibi, B., Stork, D. G., and Wolff, G. J. Optimal brain
2019-07-22. surgeon and general network pruning. In IEEE inter-
national conference on neural networks, pp. 293–299.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and
IEEE, 1993.
Fei-Fei, L. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
and pattern recognition, pp. 248–255. Ieee, 2009. ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
Ding, X., Ding, G., Han, J., and Tang, S. Auto-balanced fil-
pp. 770–778, 2016a.
ter pruning for efficient convolutional neural networks.
In Thirty-Second AAAI Conference on Artificial Intelli- He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
gence, 2018. in deep residual networks. In European conference on
computer vision, pp. 630–645. Springer, 2016b.
Dong, X., Huang, J., Yang, Y., and Yan, S. More is less: A
more complicated network with less inference complex- He, Y., Zhang, X., and Sun, J. Channel pruning for accel-
ity. In Proceedings of the IEEE Conference on Computer erating very deep neural networks. In Proceedings of the
Vision and Pattern Recognition, pp. 5840–5848, 2017. IEEE International Conference on Computer Vision, pp.
1389–1397, 2017.
Dubey, A., Chatterjee, M., and Ahuja, N. Coreset-based
neural network compression. In Proceedings of the Eu- He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. Soft
ropean Conference on Computer Vision (ECCV), pp. filter pruning for accelerating deep convolutional neural
454–470, 2018. networks. In IJCAI International Joint Conference on
Artificial Intelligence, 2018a.
Figurnov, M., Ibraimova, A., Vetrov, D. P., and Kohli, P.
Perforatedcnns: Acceleration through elimination of re- He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han,
dundant convolutions. In Advances in Neural Informa- S. Amc: Automl for model compression and accelera-
tion Processing Systems, pp. 947–955, 2016. tion on mobile devices. In Proceedings of the European
What is the State of Neural Network Pruning?

Conference on Computer Vision (ECCV), pp. 784–800, Lebedev, V. and Lempitsky, V. Fast convnets using group-
2018b. wise brain damage. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp.
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, 2554–2564, 2016.
Q. V., and Chen, Z. Gpipe: Efficient training of gi-
ant neural networks using pipeline parallelism. arXiv Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I.,
preprint arXiv:1811.06965, 2018. and Lempitsky, V. Speeding-up convolutional neu-
ral networks using fine-tuned cp-decomposition. arXiv
Huang, Z. and Wang, N. Data-driven sparse structure se- preprint arXiv:1412.6553, 2014.
lection for deep neural networks. In Proceedings of the
LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain
European Conference on Computer Vision (ECCV), pp.
damage. In Advances in neural information processing
304–320, 2018.
systems, pp. 598–605, 1990.
Janowsky, S. A. Pruning versus clipping in neural net- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.
works. Physical Review A, 39(12):6600–6603, June Gradient-based learning applied to document recogni-
1989. ISSN 0556-2791. doi: 10.1103/PhysRevA.39. tion. Proceedings of the IEEE, 86(11):2278–2324,
6600. URL https://link.aps.org/doi/10. 1998a.
1103/PhysRevA.39.6600.
LeCun, Y., Cortes, C., and Burges, C. The mnist database
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, of handwritten digits, 1998b. Accessed: 2019-09-6.
J., Girshick, R., Guadarrama, S., and Darrell, T. lenet.
Lee, N., Ajanthan, T., Gould, S., and Torr, P. H. S.
https://github.com/BVLC/caffe/blob/
A Signal Propagation Perspective for Pruning Neu-
master/examples/mnist/lenet.prototxt,
ral Networks at Initialization. arXiv:1906.06307 [cs,
2 2015a.
stat], June 2019a. URL http://arxiv.org/abs/
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, 1906.06307. arXiv: 1906.06307.
J., Girshick, R., Guadarrama, S., and Darrell, T. lenet- Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: single-
train-test. https://github.com/BVLC/caffe/ shot network pruning based on connection sensitivity.
blob/master/examples/mnist/lenet_ In 7th International Conference on Learning Represen-
train_test.prototxt, 2 2015b. tations, ICLR 2019, New Orleans, LA, USA, May 6-
9, 2019. OpenReview.net, 2019b. URL https://
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S.,
openreview.net/forum?id=B1VZqjAcYX.
Long, J., Girshick, R., Guadarrama, S., and
Darrell, T. Training lenet on mnist with caffe. Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,
https://caffe.berkeleyvision.org/ H. P. Pruning filters for efficient convnets. arXiv preprint
gathered/examples/mnist.html, 5 2016. arXiv:1608.08710, 2016.
Accessed: 2019-07-22.
Lindstrom, P. Fixed-rate compressed floating-point arrays.
Jogeshwar, A. Validating resnet50. https://github. IEEE transactions on visualization and computer graph-
com/keras-team/keras/issues/8672, 12 ics, 20(12):2674–2683, 2014.
2017. Accessed: 2019-07-22. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C.
Learning efficient convolutional networks through net-
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,
work slimming. In Proceedings of the IEEE Interna-
Casagrande, N., Lockhart, E., Stimberg, F., Oord, A.
tional Conference on Computer Vision, pp. 2736–2744,
v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neu-
2017.
ral audio synthesis. arXiv preprint arXiv:1802.08435,
2018. Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Re-
thinking the value of network pruning. In 7th Interna-
Karnin, E. D. A simple procedure for pruning back- tional Conference on Learning Representations, ICLR
propagation trained neural networks. IEEE transactions 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe-
on neural networks, 1(2):239–242, 1990. view.net, 2019. URL https://openreview.net/
forum?id=rJlnB3C5Ym.
Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., and
Shin, D. Compression of deep convolutional neural net- Louizos, C., Ullrich, K., and Welling, M. Bayesian com-
works for fast and low power mobile applications. arXiv pression for deep learning. In Advances in Neural Infor-
preprint arXiv:1511.06530, 2015. mation Processing Systems, pp. 3288–3298, 2017.
What is the State of Neural Network Pruning?

Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level Ratanaworabhan, P., Ke, J., and Burtscher, M. Fast loss-
pruning method for deep neural network compression. less compression of scientific floating-point data. In
In Proceedings of the IEEE international conference on Data Compression Conference (DCC’06), pp. 133–142.
computer vision, pp. 5058–5066, 2017. IEEE, 2006.

Mariet, Z. and Sra, S. Diversity networks: Neural network Reed, R. Pruning algorithms-a survey. IEEE Trans-
compression using determinantal point processes. arXiv actions on Neural Networks, 4(5):740–747, Septem-
preprint arXiv:1511.05077, 2015. ber 1993. ISSN 10459227. doi: 10.1109/72.
248452. URL http://ieeexplore.ieee.org/
Molchanov, D., Ashukha, A., and Vetrov, D. Variational document/248452/.
dropout sparsifies deep neural networks. In Proceed-
ings of the 34th International Conference on Machine Siedelmann, H., Wender, A., and Fuchs, M. High speed
Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017. lossless image compression. In German Conference on
Pattern Recognition, pp. 343–355. Springer, 2015.
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J.
Pruning convolutional neural networks for resource effi- Simonyan, K. and Zisserman, A. Very deep convolu-
cient inference. arXiv preprint arXiv:1611.06440, 2016. tional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
Morcos, A. S., Yu, H., Paganini, M., and Tian, Y. One ticket
to win them all: generalizing lottery ticket initializations Suau, X., Zappella, L., and Apostoloff, N. Network com-
across datasets and optimizers. arXiv:1906.02773 [cs, pression using correlation analysis of layer responses.
stat], June 2019. URL http://arxiv.org/abs/ 2018.
1906.02773. arXiv: 1906.02773.
Suzuki, T., Abe, H., Murata, T., Horiuchi, S., Ito, K.,
Mozer, M. C. and Smolensky, P. Skeletonization: A tech- Wachi, T., Hirai, S., Yukishima, M., and Nishimura,
nique for trimming the fat from a network via relevance T. Spectral-pruning: Compressing deep neural network
assessment. In Advances in neural information process- via spectral analysis. arXiv preprint arXiv:1808.08558,
ing systems, pp. 107–115, 1989a. 2018.
Mozer, M. C. and Smolensky, P. Using Relevance to Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. Efficient
Reduce Network Size Automatically. Connection processing of deep neural networks: A tutorial and sur-
Science, 1(1):3–16, January 1989b. ISSN 0954-0091, vey. arXiv preprint arXiv:1703.09039, 2017.
1360-0494. doi: 10.1080/09540098908915626. URL
https://www.tandfonline.com/doi/full/ Tan, M. and Le, Q. V. Efficientnet: Rethinking model scal-
10.1080/09540098908915626. ing for convolutional neural networks. arXiv preprint
arXiv:1905.11946, 2019.
Nola, D. Keras doesn’t reproduce caffe example code
accuracy. https://github.com/keras-team/ Tresp, V., Neuneier, R., and Zimmermann, H.-G. Early
keras/issues/4444, 11 2016. Accessed: 2019-07- brain damage. In Advances in neural information pro-
22. cessing systems, pp. 669–675, 1997.

Northcutt, C. Towards reproducibil- Vryniotis, V. Change bn layer to use moving mean/var


ity: Benchmarking keras and pytorch. if frozen. https://github.com/keras-team/
https://l7.curtisnorthcutt.com/towards-reproducibility- keras/pull/9965, 4 2018. Accessed: 2019-07-22.
benchmarking-keras-pytorch, 2 2019. Accessed:
2019-07-22. Wang, P. and Cheng, J. Accelerating convolutional neural
networks for mobile applications. In Proceedings of the
Parkhi, O. M., Vedaldi, A., Zisserman, A., et al. Deep face 24th ACM international conference on Multimedia, pp.
recognition. In bmvc, volume 1, pp. 6, 2015. 541–545. ACM, 2016.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learn-
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and ing structured sparsity in deep neural networks. In Ad-
Lerer, A. Automatic differentiation in pytorch. 2017. vances in neural information processing systems, pp.
2074–2082, 2016.
Peng, B., Tan, W., Li, Z., Zhang, S., Xie, D., and Pu, S.
Extreme network compression via filter group approxi- Yamamoto, K. and Maeno, K. Pcas: Pruning channels with
mation. In Proceedings of the European Conference on attention statistics. arXiv preprint arXiv:1806.05382,
Computer Vision (ECCV), pp. 300–316, 2018. 2018.
What is the State of Neural Network Pruning?

Yang, T.-J., Chen, Y.-H., and Sze, V. Designing energy-


efficient convolutional neural networks using energy-
aware pruning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 5687–
5695, 2017.
Yao, Z., Cao, S., and Xiao, W. Balanced sparsity
for efficient dnn inference on gpu. arXiv preprint
arXiv:1811.00206, 2018.
Yu, R., Li, A., Chen, C.-F., Lai, J.-H., Morariu, V. I., Han,
X., Gao, M., Lin, C.-Y., and Davis, L. S. Nisp: Pruning
networks using neuron importance score propagation. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pp. 9194–9203, 2018.
Zagoruyko, S. 92.45% on cifar-10 in torch. https://
torch.ch/blog/2015/07/30/cifar.html, 7
2015. Accessed: 2019-07-22.

Zhang, X., Zou, J., He, K., and Sun, J. Accelerating very
deep convolutional networks for classification and detec-
tion. IEEE transactions on pattern analysis and machine
intelligence, 38(10):1943–1955, 2015.
Zhao, W. X., Zhang, X., Lemire, D., Shan, D., Nie, J.-Y.,
Yan, H., and Wen, J.-R. A general simd-based approach
to accelerating compression algorithms. ACM Transac-
tions on Information Systems (TOIS), 33(3):15, 2015.
Zukowski, M., Heman, S., Nes, N., and Boncz, P. Super-
scalar ram-cpu cache compression. In Data Engineering,
2006. ICDE’06. Proceedings of the 22nd International
Conference on, pp. 59–59. IEEE, 2006.
What is the State of Neural Network Pruning?

A C ORPUS AND DATA C LEANING B C HECKLIST FOR E VALUATING A


We selected the 81 papers used in our analysis in the fol-
P RUNING M ETHOD
lowing way. First, we conducted an ad hoc literature For any pruning technique proposed, check if:
search, finding widely cited papers introducing pruning
methods and identifying other pruning papers that cited • It is contextualized with respect to magnitude prun-
them using Google Scholar. We then went through the con- ing, recently-published pruning techniques, and prun-
ference proceedings from the past year’s NeurIPS, ICML, ing techniques proposed prior to the 2010s.
CVPR, ECCV, and ICLR and added all relevant papers
(though it is possible we had false dismissals if the title • The pruning algorithm, constituent subroutines (e.g.,
and abstract did not seem relevant to pruning). Finally, score, pruning, and fine-tuning functions), and hyper-
during the course of cataloging which papers compared to parameters are presented in enough detail for a reader
which others, we added to our corpus any pruning paper to reimplement and match the results in the paper.
that at least one existing paper in our corpus purported to
compare to. We included both published papers and un- • All claims about the technique are appropriately
published ones of reasonable quality (typically on arXiv). restricted to only the experiments presented (e.g.,
Since we make strong claims about the lack of compar- CIFAR-10, ResNets, image classification tasks, etc.).
isons, we included in our corpus five papers whose meth- • There is a link to downloadable source code.
ods technically do not meet our definition of pruning but
are similar in spirit and compared to by various pruning
For all experiments, check if you include:
papers. In short, we included essentially every paper intro-
ducing a method of pruning neural networks that we could
find, taking care to capture the full directed graph of papers • A detailed description of the architecture with hyper-
and comparisons between them. parameters in enough detail to for a reader to reimple-
ment it and train it to the same performance reported
Because different papers report slightly different metrics, in the paper.
particularly with respect to model size, we converted re-
ported results to a standard set of metrics whenever possi- • If the architecture is not novel: a citation for the ar-
ble. For example, we converted reported Top-1 error rates chitecture/hyperparameters and a description of any
to Top-1 accuracies, and fractions of parameters pruned to differences in architecture, hyperparameters, or per-
compression ratios. Note that it is not possible to con- formance in this paper.
vert between size metrics and speedup metrics, since the
• A detailed description of the dataset hyperparameters
amount of computation associated with a given parameter
(e.g., batch size and augmentation regime) in enough
can depend on the layer in which it resides (since convo-
detail for a reader to reimplement it.
lutional filters are reused at many spatial positions). For
simplicity and uniformity, we only consider self-reported • A description of the library and hardware used.
results except where stated otherwise.
We also did not attempt to capture all reported metrics, but For all results, check if:
instead focused only on model size reduction and theoret-
ical speedup, since 1) these are by far the most commonly • Data is presented across a range of compression ratios,
reported and, 2) there is already a dearth of directly compa- including extreme compression ratios at which the ac-
rable numbers even for these common metrics. This is not curacy of the pruned network declines substantially.
entirely fair to methods designed to optimize other metrics,
such as power consumption (Louizos et al., 2017; Yang • Data specifies the raw accuracy of the network at each
et al., 2017; Han et al., 2015; Kim et al., 2015), memory point.
bandwidth usage (Peng et al., 2018; Kim et al., 2015), or • Data includes multiple runs with separate initializa-
fine-tuning time (Dubey et al., 2018; Yamamoto & Maeno, tions and random seeds.
2018; Huang & Wang, 2018; He et al., 2018a), and we con-
sider this a limitation of our analysis. • Data includes clearly defined error bars and a measure
of central tendency (e.g., mean) and variation (e.g.,
Lastly, as a result of relying on reading of hundreds of
standard deviation).
pages of dense technical content, we are confident that we
have made some number of isolated errors. We therefore • Data includes FLOP-counts if the paper makes argu-
welcome correction by email and refer the reader to the ments about efficiency and performance due to prun-
arXiv version of this paper for the most up-to-date revision. ing.
What is the State of Neural Network Pruning?

For all pruning results presented, check if there is a com- • Epochs: 30


parison to: • Optimizer: Adam
• Initial Learning Rate: 3 × 10−4
• A random pruning baseline.
• Learning rate schedule: Fixed
– A global random pruning baseline.
– A random pruning baseline with the same layer- All reported ImageNet experiments used the following
wise pruning proportions as the proposed tech- finetuning setup
nique.
• Batch size: 256
• A magnitude pruning baseline. • Epochs: 20
– A global or uniform layerwise proportion magni- • Optimizer: SGD with Nesterov Momentum (0.9)
tude pruning baseline. • Initial Learning Rate: 1 × 10−3
– A magnitude pruning baseline with the same lay- • Learning rate schedule: Fixed
erwise pruning proportions as the proposed tech-
nique.
D A DDITIONAL R ESULTS
• Other relevant state-of-the-art techniques, including:
Here we include the entire set of results obtained with
– A description of how the comparisons were pro- ShrinkBench. For CIFAR10, results are included for
duced (data taken from paper, reimplementation, CIFAR-VGG, ResNet-20, ResNet-56 and ResNet-110.
or reuse of code from the paper) and any differ- Standard deviations across three different random runs are
ences or uncertainties between this setting and plotted as error bars. For ImageNet, results are reported for
the setting used in the main experiments. ResNet-18.

C E XPERIMENTAL S ETUP
For reproducibility purposes, ShrinkBench fixes ran-
dom seeds for all the dependencies (PyTorch, NumPy,
Python).

C.1 Pruning Methods


For the reported experiments, we did not prune the clas-
sifier layer preceding the softmax. ShrinkBench supports
pruning said layer as an option to all proposed pruning
strategies. For both Global and Layerwise Gradient Mag-
nitude Pruning a single minibatch is used to compute the
gradients for the pruning. Three independent runs using
different random seeds were performed for every CIFAR10
experiment. We found some variance across methods that
relied on randomness, such as random pruning or gradient
based methods that use a sampled minibatch to compute
the gradients with respect to the weights.

C.2 Finetuning Setup


Pruning was performed from the pretrained weights and
fixed from there forwards. Early stopping is implemented
during finetuning. Thus if the validation accuracy repeat-
edly decreases after some point we stop the finetuning pro-
cess to prevent overfitting.
All reported CIFAR10 experiments used the following fine-
tuning setup:
• Batch size: 64
What is the State of Neural Network Pruning?

0.95 CIFAR-VGG on CIFAR-10 0.95 CIFAR-VGG on CIFAR-10

0.90 0.90

0.85 0.85
Accuracy

Accuracy
0.80 Global Weight 0.80
Layer Weight
Global Gradient
0.75 Layer Gradient 0.75
Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 9: Accuracy for several levels of compression Figure 10: Accuracy vs theoretical speedup for
for CIFAR-VGG on CIFAR-10 CIFAR-VGG on CIFAR-10

0.95 ResNet-20 on CIFAR-10 0.95 ResNet-20 on CIFAR-10

0.90 0.90

0.85 0.85
Accuracy

Accuracy

0.80 Global Weight 0.80 Global Weight


Layer Weight Layer Weight
Global Gradient Global Gradient
0.75 Layer Gradient 0.75 Layer Gradient
Random Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 11: Accuracy for several levels of compres- Figure 12: Accuracy vs theoretical speedup for
sion for ResNet-20 on CIFAR-10 ResNet-20 on CIFAR-10

0.95 ResNet-56 on CIFAR-10 0.95 ResNet-56 on CIFAR-10

0.90 0.90

0.85 0.85
Accuracy

Accuracy

0.80 Global Weight 0.80 Global Weight


Layer Weight Layer Weight
Global Gradient Global Gradient
0.75 Layer Gradient 0.75 Layer Gradient
Random Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 13: Accuracy for several levels of compres- Figure 14: Accuracy vs theoretical speedup for
sion for ResNet-56 on CIFAR-10 ResNet-56 on CIFAR-10
What is the State of Neural Network Pruning?

0.95 ResNet-110 on CIFAR-10 0.95 ResNet-110 on CIFAR-10

0.90 0.90

0.85 0.85
Accuracy

Accuracy
0.80 Global Weight 0.80 Global Weight
Layer Weight Layer Weight
Global Gradient Global Gradient
0.75 Layer Gradient 0.75 Layer Gradient
Random Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 15: Accuracy for several levels of compres- Figure 16: Accuracy vs theoretical speedup for
sion for ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-10

ResNet-18 on ImageNet ResNet-18 on ImageNet


0.70 0.70
0.65 0.65
Top 1 Accuracy

Top 1 Accuracy

0.60 0.60
0.55 0.55
0.50 Global Weight 0.50 Global Weight
Layer Weight Layer Weight
0.45 Global Gradient 0.45 Global Gradient
Layer Gradient Layer Gradient
0.40 0.40
1 2 4 8 16 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 17: Accuracy for several levels of compres- Figure 18: Accuracy vs theoretical speedup for
sion for ResNet-18 on ImageNet ResNet-18 on ImageNet

You might also like