MLSys 2020 What Is The State of Neural Network Pruning Paper
MLSys 2020 What Is The State of Neural Network Pruning Paper
Davis Blalock * 1 Jose Javier Gonzalez Ortiz * 1 Jonathan Frankle 1 John Guttag 1
A BSTRACT
Neural network pruning—the task of reducing the size of a network by removing parameters—has been the
subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview
of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers
and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers
from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to
compare pruning techniques to one another or determine how much progress the field has made over the past
three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and
introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We
use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent
common pitfalls when comparing pruning methods.
inference, different parameters may have different impacts. racy. In fact, for small amounts of compression, pruning
For instance, in convolutional layers, filters applied to spa- can sometimes increase accuracy (Han et al., 2015; Suzuki
tially larger inputs are associated with more computation et al., 2018). This basic finding has been replicated in a
than those applied to smaller inputs. large fraction of the papers in our corpus.
Regardless of the goal, pruning imposes a tradeoff between Along the same lines, it has been repeatedly shown that, at
model efficiency and quality, with pruning increasing the least for large amounts of pruning, many pruning methods
former while (typically) decreasing the latter. This means outperform random pruning (Yu et al., 2018; Gale et al.,
that a pruning method is best characterized not by a single 2019; Frankle et al., 2019; Mariet & Sra, 2015; Suau et al.,
model it has pruned, but by a family of models correspond- 2018; He et al., 2017). Interestingly, this does not always
ing to different points on the efficiency-quality curve. To hold for small amounts of pruning (Morcos et al., 2019).
quantify efficiency, most papers report at least one of two Similarly, pruning all layers uniformly tends to perform
metrics. The first is the number of multiply-adds (often worse than intelligently allocating parameters to different
referred to as FLOPs) required to perform inference with layers (Gale et al., 2019; Han et al., 2015; Li et al., 2016;
the pruned network. The second is the fraction of param- Molchanov et al., 2016; Luo et al., 2017) or pruning glob-
eters pruned. To measure quality, nearly all papers report ally (Lee et al., 2019b; Frankle & Carbin, 2019). Lastly,
changes in Top-1 or Top-5 image classification accuracy. when holding the number of fine-tuning iterations constant,
many methods produce pruned models that outperform re-
As others have noted (Lebedev et al., 2014; Figurnov et al.,
training from scratch with the same sparsity pattern (Zhang
2016; Louizos et al., 2017; Yang et al., 2017; Han et al.,
et al., 2015; Yu et al., 2018; Louizos et al., 2017; He et al.,
2015; Kim et al., 2015; Wen et al., 2016; Luo et al., 2017;
2017; Luo et al., 2017; Frankle & Carbin, 2019) (at least
He et al., 2018b), these metrics are far from perfect. Param-
with a large enough amount of pruning (Suau et al., 2018)).
eter and FLOP counts are a loose proxy for real-world la-
Retraining from scratch in this context means training a
tency, throughout, memory usage, and power consumption.
fresh, randomly-initialized model with all weights clamped
Similarly, image classification is only one of the countless
to zero throughout training, except those that are nonzero
tasks to which neural networks have been applied. How-
in the pruned model.
ever, because the overwhelming majority of papers in our
corpus focus on these metrics, our meta-analysis necessar- Another consistent finding is that sparse models tend to
ily does as well. outperform dense ones for a fixed number of parameters.
Lee et al. (2019a) show that increasing the nominal size
3 L ESSONS FROM THE L ITERATURE of ResNet-20 on CIFAR-10 while sparsifying to hold the
number of parameters constant decreases the error rate.
After aggregating results from a corpus of 81 papers, we
Kalchbrenner et al. (2018) obtain a similar result for audio
identified a number of consistent findings. In this section,
synthesis, as do Gray et al. (2017) for a variety of additional
we provide an overview of our corpus and then discuss
tasks across various domains. Perhaps most compelling of
these findings.
all are the many results, including in Figure 1, showing that
3.1 Papers Used in Our Analysis pruned models can obtain higher accuracies than the origi-
nal models from which they are derived. This demonstrates
Our corpus consists of 79 pruning papers published since that sparse models can not only outperform dense counter-
2010 and two classic papers (LeCun et al., 1990; Hassibi parts with the same number of parameters, but sometimes
et al., 1993) that have been compared to by a number of dense models with even more parameters.
recent methods. We selected these papers by identifying
popular papers in the literature and what cites them, sys- 3.3 Pruning vs Architecture Changes
tematically searching through conference proceedings, and
tracing the directed graph of comparisons between prun- One current unknown about pruning is how effective it
ing papers. This last procedure results in the property that, tends to be relative to simply using a more efficient archi-
barring oversights on our part, there is no pruning paper tecture. These options are not mutually exclusive, but it
in our corpus that compares to any pruning paper outside may be useful in guiding one’s research or development
of our corpus. Additional details about our corpus and its efforts to know which choice is likely to have the larger
construction can be found in Appendix A. impact. Along similar lines, it is unclear how pruned mod-
els from different architectures compare to one another—
3.2 How Effective is Pruning? i.e., to what extent does pruning offer similar benefits
across architectures? To address these questions, we plot-
One of the clearest findings about pruning is that it works.
ted the reported accuracies and compression/speedup levels
More precisely, there are various methods that can sig-
of pruned models on ImageNet alongside the same metrics
nificantly compress models with little or no loss of accu-
What is the State of Neural Network Pruning?
for different architectures with no pruning (Figure 1).1 We Speed and Size Tradeoffs for Original and Pruned Models
plot results within a family of models as a single curve.2 85
of experimental standardization and the resulting fragmen- Number of Parameters Number of FLOPs
MobileNet-v2 (2018) ResNet (2016) VGG (2014) EfficientNet (2019)
tation in reported results. This fragmentation makes it dif- MobileNet-v2 Pruned ResNet Pruned VGG Pruned
ficult for even the most committed authors to compare to
Figure 1: Size and speed vs accuracy tradeoffs for dif-
more than a few existing methods.
ferent pruning methods and families of architectures.
Pruned models sometimes outperform the original ar-
4.1 Omission of Comparison chitecture, but rarely outperform a better architecture.
Many papers claim to advance the state of the art, but Ignoring Recent Methods Even when considering only
don’t compare to other methods—including many pub- post-2010 approaches, there are still virtually no methods
lished ones—that make the same claim. that have been shown to outperform all existing “state-of-
the-art” methods. This follows from the fact, depicted in
Ignoring Pre-2010s Methods There was already a rich the top plot of Figure 2, that there are dozens of modern
body of work on neural network pruning by the mid 1990s papers—including many affirmed through peer review—
(see, e.g., Reed’s survey (Reed, 1993)), which has been al- that have never been compared to by any later study.
most completely ignored except for Lecun’s Optimal Brain
Damage (LeCun et al., 1990) and Hassibi’s Optimal Brain A related problem is that papers tend to compare to few
Surgeon (Hassibi et al., 1993). Indeed, multiple authors existing methods. In the lower plot of Figure 2, we see
have rediscovered existing methods or aspects thereof, with that more than a fourth of our corpus does not compare
Han et al. (2015) reintroducing the magnitude-based prun- to any previously proposed pruning method, and another
ing of Janowsky (1989), Lee et al. (2019b) reintroducing fourth compares to only one. Nearly all papers compare to
the saliency heuristic of Mozer & Smolensky (1989a), and three or fewer. This might be adequate if there were a clear
He et al. (2018a) reintroducing the practice of “reviving” progression of methods with one or two “best” methods at
previously pruned weights described in Tresp et al. (1997). any given time, but this is not the case.
compared to this many times Number of Papers Comparing to a Given Paper Number of Papers
32 (Dataset, Architecture) Pair using Pair
Number of papers
28 ImageNet VGG-16 22
24
20 ImageNet ResNet-50 15
16 MNIST LeNet-5-Caffe 14
12 CIFAR-10 ResNet-56 14
8
4 MNIST LeNet-300-100 12
0 MNIST LeNet-5 11
0 3 6 9 12 15 18
Compared to by this many other papers ImageNet CaffeNet 10
Number of Papers a Given Paper Compares To CIFAR-10 CIFAR-VGG (Torch) 8
compare to this many others
ImageNet AlexNet 8
Number of papers that
21
18 ImageNet ResNet-18 6
15 ImageNet ResNet-34 6
12 CIFAR-10 ResNet-110 5
9
6 CIFAR-10 PreResNet-164 4
3 CIFAR-10 ResNet-32 4
0 Table 1: All combinations of dataset and architecture
0 2 4 6 8 10
Compares to this many other papers used in at least 4 out of 81 papers.
Peer-Reviewed Other ods are nearby on the x-axis, it is not clear whether one
Figure 2: Reported comparisons between papers. meaningfully outperforms another since neither reports a
standard deviation or other measure of central tendency. Fi-
world networks. MNIST results may be particularly un- nally, most papers in our corpus do not report any results
likely to generalize, since this dataset differs significantly with any of these common configurations.
from other popular datasets for image classification. In par-
ticular, its images are grayscale, composed mostly of zeros, 4.4 Incomplete Characterization of Results
and possible to classify with over 99% accuracy using sim-
ple models (LeCun et al., 1998b). If all papers reported a wide range of points in their trade-
off curves across a large set of models and datasets, there
4.3 Metrics Fragmentation might be some number of direct comparisons possible be-
tween any given pair of methods. As we see in the upper
As depicted in Figure 3, papers report a wide variety of half of Figure 4, however, most papers use at most three
metrics and operating points, making it difficult to com- (dataset, architecture) pairs; and as we see in the lower half,
pare results. Each column in this figure is one (dataset, ar- they use at most three—and often just one—point to char-
chitecture) combination taken from the four most common acterize each curve. Combined with the fragmentation in
combinations4 , excluding results on MNIST. Each row is experimental choices, this means that different methods’
one pair of metrics. Each curve is the efficiency vs accu- results are rarely directly comparable. Note that the lower
racy tradeoff obtained by one method.5 Methods are color- half restricts results to the four most common (dataset, ar-
coded by year. chitecture) pairs.
It is hard to identify any consistent trends in these plots,
aside from the existence of a tradeoff between efficiency 4.5 Confounding Variables
and accuracy. A given method is only present in a small
Even when comparisons include the same datasets, models,
subset of plots. Methods from later years do not consis-
metrics, and operating points, other confounding variables
tently outperform methods from earlier years. Methods
still make meaningful comparisons difficult. Some vari-
within a plot are often incomparable because they report
ables of particular interest include:
results at different points on the x-axis. Even when meth-
• Accuracy and efficiency of the initial model
4
We combined the results for AlexNet and CaffeNet, which
is a slightly modified version of AlexNet (caf, 2016), since many • Data augmentation and preprocessing
authors refer to the latter as “AlexNet,” and it is often unclear
which model was used. • Random variations in initialization, training, and fine-
5
Since what counts as one method can be unclear, we consider tuning. This includes choice of optimizer, hyperparam-
all results from one paper to be one method except when two or eters, and learning rate schedule.
more named methods within the paper report using at least one
identical x-coordinate (i.e., when the paper’s results can’t be plot- • Pruning and fine-tuning schedule
ted as one curve).
• Deep learning library. Different libraries are known to
What is the State of Neural Network Pruning?
2 1 0.0
Change in
1
0 2
0.5
2 3
2
1.0
4
4 3
0 1 2 3 4 1 2 3 4 0 1 2 3 4 0 2 4
Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio)
0
Top-5 Accuracy (%)
2 0
1
1
Change in
0 2 1
1 3 2
2
4
3
0 1 2 3 0 2 4 0.0 0.5 1.0 1.5 2.0
Log2(Compression Ratio) Log2(Compression Ratio) Log2(Compression Ratio)
4 0.0 0
Top-1 Accuracy (%)
0.5 0
2 1
Change in
1.0 1
0 2
1.5
3 2
2 2.0
2.5 4 3
4
2 4 6 1 2 3 2 3 1 2 3
Theoretical Speedup Theoretical Speedup Theoretical Speedup Theoretical Speedup
2 0 0
Top-5 Accuracy (%)
0 2
Change in
4 1
2
6
4 2
8
6
10
2 4 6 8 10 2 4 6 2 3
Theoretical Speedup Theoretical Speedup Theoretical Speedup
Collins 2014 Kim 2016 Lin 2017 Dubey 2018, AP+Coreset-K Peng 2018 Choi 2019
Han 2015 Srinivas 2016 Luo 2017 Dubey 2018, AP+Coreset-S Suau 2018, PFA-En Gale 2019, Magnitude-v2
Zhang 2015 Wen 2016 Srinivas 2017 He, Yang 2018 Suau 2018, PFA-KL Kim 2019
Figurnov 2016 Alvarez 2017 Yang 2017 He, Yang 2018, Fine-Tune Suzuki 2018 Liu 2019, Scratch-B
Guo 2016 He 2017 Carreira-Perpinan 2018 He, Yihui 2018 Yamamoto 2018 Luo 2019
Han 2016 He 2017, 3C Ding 2018 Huang 2018 Yu 2018 Peng 2019, CCP
Hu 2016 Li 2017 Dubey 2018, AP+Coreset-A Lin 2018 Zhuang 2018 Peng 2019, CCP-AC
Figure 3: Fragmentation of results. Shown are all self-reported results on the most common (dataset, architecture)
combinations. Each column is one combination, each row shares an accuracy metric (y-axis), and pairs of rows
share a compression metric (x-axis). Up and to the right is always better. Standard deviations are shown for He
2018 on CIFAR-10, which is the only result that provides any measure of central tendency. As suggested by the
legend, only 37 out of the 81 papers in our corpus report any results using any of these configurations.
yield different accuracies for the same architecture and both used the same code as the methods to which it com-
dataset (Northcutt, 2019; Nola, 2016) and may have sub- pares and reports enough measurements to average out ran-
tly different behaviors (Vryniotis, 2018). dom variations. This is exceptionally rare, with Gale et al.
(2019) and Liu et al. (2019) being arguably the only ex-
• Subtle differences in code and environment that may
amples. Moreover, neither of these papers introduce novel
not be easily attributable to any of the above variations
pruning methods per se but are instead inquiries into the
(Crall, 2018; Jogeshwar, 2017; unr, 2017).
efficacy of existing methods.
In general, it is not clear that any paper can succeed in ac- Many papers attempt to account for subsets of these con-
counting for all of these confounders unless that paper has founding variables. A near universal practice in this re-
What is the State of Neural Network Pruning?
21
Number of (Dataset, Architecture) Pairs Used Pruning ResNet-50 with Unstructured Magnitude-Based Pruning
76
using this many pairs
18
Number of papers
21 74
18
15 72
12
9 70
6 68
3
0
1 2 3 4 5 6 7 8 9 10
6 7
10
Number of points Number of Parameters
Peer-Reviewed Other Frankle 2019, PruneAtEpoch=15 Dubey 2018, AP+Coreset-K
Frankle 2019, PruneAtEpoch=90 Dubey 2018, AP+Coreset-S
Figure 4: Number of results reported by each paper, Frankle 2019, ResetToEpoch=10 Gale 2019, SparseVD
excluding MNIST. Top) Most papers report on three or Frankle 2019, ResetToEpoch=R Huang 2018
Gale 2019, Magnitude Lin 2018
fewer (dataset, architecture) pairs. Bottom) For each Gale 2019, Magnitude-v2 Liu 2019, Scratch-B
pair used, most papers characterize their tradeoff be- Liu 2019, Magnitude Luo 2017
tween amount of pruning and accuracy using a single Alvarez 2017 Yamamoto 2018
Dubey 2018, AP+Coreset-A Zhuang 2018
point in the efficiency vs accuracy curve. In both plots,
the pattern holds even for peer-reviewed papers. Figure 5: Pruning ResNet-50 on ImageNet. Methods in
the upper plot all prune weights with the smallest mag-
gard is reporting change in accuracy relative to the original nitudes, but differ in implementation, pruning sched-
model, in addition to or instead of raw accuracy. This helps ule, and fine-tuning. The variation caused by these vari-
to control for the accuracy of the initial model. However, as ables is similar to the variation across different pruning
we demonstrate in Section 7, this is not sufficient to remove methods, whose results are shown in the lower plot. All
initial model as a confounder. Certain initial models can be results are taken from the original papers.
pruned more or less efficiently, in terms of the accuracy vs
compression tradeoff. This holds true even with identical
pruning methods and all other variables held constant. 5 F URTHER BARRIERS TO C OMPARISON
In the previous section, we discussed the fragmentation of
There are at least two more empirical reasons to believe that datasets, models, metrics, operating points, and experimen-
confounding variables can have a significant impact. First, tal details, and how this fragmentation makes evaluating
as one can observe in Figure 3, methods often introduce the efficacy of individual pruning methods difficult. In this
changes in accuracy of much less than 1% at reported op- section, we argue that there are additional barriers to com-
erating points. This means that, even if confounders have paring methods that stem from common practices in how
only a tiny impact on accuracy, they can still have a large methods and results are presented.
impact on which method appears better.
5.1 Architecture Ambiguity
Second, as shown in Figure 5, existing results demonstrate
that different training and fine-tuning settings can yield It is often difficult, or even impossible, to identify the exact
nearly as much variability as different methods. Specif- architecture that authors used. Perhaps the most prevalent
ically, consider 1) the variability introduced by differ- example of this is when authors report using some sort of
ent fine-tuning methods for unstructured magnitude-based ResNet (He et al., 2016a;b). Because there are two different
pruning (Figure 6 top) and 2) the variability introduced by variations of ResNets, introduced in these two papers, say-
entirely different pruning methods (Figure 6 bottom). The ing that one used a “ResNet-50” is insufficient to identify a
variability between fine-tuning methods is nearly as large particular architecture. Some authors do appear to deliber-
as the variability between pruning methods. ately point out the type of ResNet they use (e.g., (Liu et al.,
2017; Dong et al., 2017)). However, given that few papers
What is the State of Neural Network Pruning?
even hint at the possibility of confusion, it seems unlikely times never made clear. Even when reporting FLOPs,
that all authors are even aware of the ambiguity, let alone which is nominally a consistent metric, different authors
that they have cited the corresponding paper in all cases. measure it differently (e.g., (Molchanov et al., 2016) vs
(Wang & Cheng, 2016)), though most often papers entirely
Perhaps the greatest confusion is over VGG networks (Si-
omit their formula for computing FLOPs. We found up
monyan & Zisserman, 2014). Many papers describe exper-
to a factor of four variation in the reported FLOPs of dif-
imenting on “VGG-16,” “VGG,” or “VGGNet,” suggesting
ferent papers for the same architecture and dataset, with
a standard and well-known architecture. In many cases,
(Yang et al., 2017) reporting 371 MFLOPs for AlexNet on
what is actually used is a custom variation of some VGG
ImageNet, (Choi et al., 2019) reporting 724 MFLOPs, and
model, with removed fully-connected layers (Changpinyo
(Han et al., 2015) reporting 1500 MFLOPs.
et al., 2017; Luo et al., 2017), smaller fully-connected lay-
ers (Lee et al., 2019b), or added dropout or batchnorm (Liu
et al., 2017; Lee et al., 2019b; Peng et al., 2018; Molchanov 6 S UMMARY AND R ECOMMENDATIONS
et al., 2017; Ding et al., 2018; Suau et al., 2018).
In the previous sections, we have argued that existing work
In some cases, papers simply fail to make clear what model tends to
they used (even for non-VGG architectures). For exam- • make it difficult to identify the exact experimental setup
ple, one paper just states that their segmentation model and metrics,
“is composed from an inception-like network branch and a
• use too few (dataset, architecture) combinations,
DenseNet network branch.” Another paper attributes their
VGGNet to (Parkhi et al., 2015), which mentions three • report too few points in the tradeoff curve for any given
VGG networks. Liu et al. (2019) and Frankle & Carbin combination, and no measures of central tendency,
(2019) have circular references to one another that can no • omit comparison to many methods that might be state-
longer be resolved because of simultaneous revisions. One of-the-art, and
paper mentions using a “VGG-S” from the Caffe Model • fail to control for confounding variables.
Zoo, but as of this writing, no model with this name ex-
These problems often make it difficult or impossible to as-
ists there. Perhaps the most confusing case is the Lenet-
sess the relative efficacy of different pruning methods. To
5-Caffe reported in one 2017 paper. The authors are to
enable direct comparison between methods in the future,
be commended for explicitly stating not only that they use
we suggest the following practices:
Lenet-5-Caffe, but their exact architecture. However, they
describe an architecture with an 800-unit fully-connected • Identify the exact sets of architectures, datasets, and
layer, while examination of both the Caffe .prototxt metrics used, ideally in a structured way that is not scat-
files (Jia et al., 2015a;b) and associated blog post (Jia et al., tered throughout the results section.
2016) indicates that no such layer exists in Lenet-5-Caffe. • Use at least three (dataset, architecture) pairs, including
modern, large-scale ones. MNIST and toy models do
5.2 Metrics Ambiguity not count. AlexNet, CaffeNet, and Lenet-5 are no longer
modern architectures.
It can also be difficult to know what the reported metrics
mean. For example, many papers include a metric along • For any given pruned model, report both compression
the lines of “Pruned%”. In some cases, this means frac- ratio and theoretical speedup. Compression ratio is de-
tion of the parameters or FLOPs remaining (Suau et al., fined as the original size divided by the new size. The-
2018). In other cases, it means the fraction of parameters or oretical speedup is defined as the original number of
FLOPs removed (Han et al., 2015; Lebedev & Lempitsky, multiply-adds divided by the new number. Note that
2016; Yao et al., 2018). There is also widespread misuse of there is no reason to report only one of these metrics.
the term “compression ratio,” which the compression liter- • For ImageNet and other many-class datasets, report both
original size
ature has long used to mean compressed size (Siedelmann et al., Top-1 and Top-5 accuracy. There is again no reason to
2015; Zukowski et al., 2006; Zhao et al., 2015; Lindstrom, report only one of these.
2014; Ratanaworabhan et al., 2006; Blalock et al., 2018), • Whatever metrics one reports for a given pruned model,
but many pruning authors define (usually without making also report these metrics for an appropriate control (usu-
the formula explicit) as 1 − compressed size
original size . ally the original model before pruning).
Reported “speedup” values present similar challenges. • Plot the tradeoff curve for a given dataset and architec-
These values are sometimes wall time, sometimes original ture, alongside the curves for competing methods.
number of FLOPs divided by pruned number of FLOPs, • When plotting tradeoff curves, use at least 5 operating
sometimes a more complex formula relating these two points spanning a range of compression ratios. The set
quantities (Dong et al., 2017; He et al., 2018a), and some- of ratios {2, 4, 8, 16, 32} is a good choice.
What is the State of Neural Network Pruning?
• Report and plot means and sample standard deviations, complex methods (Han et al., 2015; 2016; Gale et al., 2019;
instead of one-off measurements, whenever feasible. Frankle et al., 2019). Gradient-based methods are less com-
• Ensure that all methods being compared use identical mon, but are simple to implement and have recently gained
libraries, data loading, and other code to the greatest ex- popularity (Lee et al., 2019b;a; Yu et al., 2018). Random
tent possible. pruning is a common straw man that can serve as a useful
debugging tool. Note that these baselines are not reproduc-
We also recommend that reviewers demand a much greater tions of any of these methods, but merely inspired by their
level of rigor when evaluating papers that claim to offer a pruning heuristics.
better method of pruning neural networks.
7.3 Avoiding Pruning Pitfalls with Shrinkbench
7 S HRINK B ENCH Using the described baselines, we pruned over 800 net-
works with varying datasets, networks, compression ratios,
7.1 Overview of ShrinkBench
initial weights and random seeds. In doing so, we identi-
To make it as easy as possible for researchers to put our fied various pitfalls associated with experimental practices
suggestions into practice, we have created an open-source that are currently common in the literature but are avoided
library for pruning called ShrinkBench. ShrinkBench pro- by using ShrinkBench.
vides standardized and extensible functionality for training, We highlight several noteworthy results below. For addi-
pruning, fine-tuning, computing metrics, and plotting, all tional experimental results and details, see Appendix D.
using a standardized set of pretrained models and datasets. One standard deviation bars across three runs are shown
ShrinkBench is based on PyTorch (Paszke et al., 2017) and for all CIFAR-10 results.
is designed to allow easy evaluation of methods with ar-
bitrary scoring functions, allocation of pruning across lay- Metrics are not Interchangeable. As discussed previ-
ers, and sparsity structures. In particular, given a callback ously, it is common practice to report either reduction in the
defining how to compute masks for a model’s parameter number of parameters or in the number of FLOPs. If these
tensors at a given iteration, ShrinkBench will automati- metrics are extremely correlated, reporting only one is suf-
cally apply the pruning, update the network according to a ficient to characterize the efficacy of a pruning method. We
standard training or fine-tuning setup, and compute metrics found after computing these metrics for the same model un-
across many models, datasets, random seeds, and levels of der many different settings that reporting one metric is not
pruning. We defer discussion of ShrinkBench’s implemen- sufficient. While these metrics are correlated, the correla-
tation and API to the project’s documentation. tion is different for each pruning method. Thus, the relative
performance of different methods can vary significantly un-
der different metrics (Figure 6).
7.2 Baselines
ResNet-18 on ImageNet
We used ShrinkBench to implement several existing prun- 0.70
ing heuristics, both as examples of how to use our library
and as baselines that new methods can compare to: 0.65
0.60
Accuracy
not be problematic. However, our results suggest that this Absolute Relative
0.0
performance is not constant. 0.9
0.1
Figure 7 shows the accuracy for various compression ra- 0.8
Accuracy
Accuracy
tios for CIFAR-VGG (Zagoruyko, 2015) and ResNet-56 0.2
on CIFAR-10. In general, Global methods are more accu- 0.7 Global A 0.3
rate than Layerwise methods and Magnitude-based meth- Global B
0.6 Layer A 0.4
ods are more accurate than Gradient-based methods, with
Layer B 0.5
random performing worst of all. However, if one were to 0.5
look only at CIFAR-VGG for compression ratios smaller 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Compression Ratio Compression Ratio
than 10, one could conclude that Global Gradient outper-
forms all other methods. Similarly, while Global Gradient Figure 8: Global and Layerwise Magnitude Pruning on
consistently outperforms Layerwise Magnitude on CIFAR- two different ResNet-56 models. Even with all other
VGG, the opposite holds on ResNet-56 (i.e., the orange and variables held constant, different initial models yield
green lines switch places). different tradeoff curves. This may cause one method
to erroneously appear better than another. Controlling
Moreover, we found that for some settings close to the for initial accuracy does not fix this.
drop-off point (such as Global Gradient, compression 16),
different random seeds yielded significantly different re- We also found that the common practice of examining
sults (0.88 vs 0.61 accuracy) due to the randomness in changes in accuracy is insufficient to correct for initial
minibatch selection. This is illustrated by the large verti- model as a confounder. Even when reporting changes, one
cal error bar in the left subplot. pruning method can artificially appear better than another
by virtue of beginning with a different model. We see this
CIFAR-VGG ResNet-56
on the right side of Figure 8, where Layerwise Magnitude
0.9 with Weights B appears to outperform Global Magnitude
with Weights A, even though the former never outperforms
0.8
Accuracy
Conference on Computer Vision (ECCV), pp. 784–800, Lebedev, V. and Lempitsky, V. Fast convnets using group-
2018b. wise brain damage. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp.
Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, 2554–2564, 2016.
Q. V., and Chen, Z. Gpipe: Efficient training of gi-
ant neural networks using pipeline parallelism. arXiv Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I.,
preprint arXiv:1811.06965, 2018. and Lempitsky, V. Speeding-up convolutional neu-
ral networks using fine-tuned cp-decomposition. arXiv
Huang, Z. and Wang, N. Data-driven sparse structure se- preprint arXiv:1412.6553, 2014.
lection for deep neural networks. In Proceedings of the
LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain
European Conference on Computer Vision (ECCV), pp.
damage. In Advances in neural information processing
304–320, 2018.
systems, pp. 598–605, 1990.
Janowsky, S. A. Pruning versus clipping in neural net- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.
works. Physical Review A, 39(12):6600–6603, June Gradient-based learning applied to document recogni-
1989. ISSN 0556-2791. doi: 10.1103/PhysRevA.39. tion. Proceedings of the IEEE, 86(11):2278–2324,
6600. URL https://link.aps.org/doi/10. 1998a.
1103/PhysRevA.39.6600.
LeCun, Y., Cortes, C., and Burges, C. The mnist database
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, of handwritten digits, 1998b. Accessed: 2019-09-6.
J., Girshick, R., Guadarrama, S., and Darrell, T. lenet.
Lee, N., Ajanthan, T., Gould, S., and Torr, P. H. S.
https://github.com/BVLC/caffe/blob/
A Signal Propagation Perspective for Pruning Neu-
master/examples/mnist/lenet.prototxt,
ral Networks at Initialization. arXiv:1906.06307 [cs,
2 2015a.
stat], June 2019a. URL http://arxiv.org/abs/
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, 1906.06307. arXiv: 1906.06307.
J., Girshick, R., Guadarrama, S., and Darrell, T. lenet- Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: single-
train-test. https://github.com/BVLC/caffe/ shot network pruning based on connection sensitivity.
blob/master/examples/mnist/lenet_ In 7th International Conference on Learning Represen-
train_test.prototxt, 2 2015b. tations, ICLR 2019, New Orleans, LA, USA, May 6-
9, 2019. OpenReview.net, 2019b. URL https://
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S.,
openreview.net/forum?id=B1VZqjAcYX.
Long, J., Girshick, R., Guadarrama, S., and
Darrell, T. Training lenet on mnist with caffe. Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,
https://caffe.berkeleyvision.org/ H. P. Pruning filters for efficient convnets. arXiv preprint
gathered/examples/mnist.html, 5 2016. arXiv:1608.08710, 2016.
Accessed: 2019-07-22.
Lindstrom, P. Fixed-rate compressed floating-point arrays.
Jogeshwar, A. Validating resnet50. https://github. IEEE transactions on visualization and computer graph-
com/keras-team/keras/issues/8672, 12 ics, 20(12):2674–2683, 2014.
2017. Accessed: 2019-07-22. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C.
Learning efficient convolutional networks through net-
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,
work slimming. In Proceedings of the IEEE Interna-
Casagrande, N., Lockhart, E., Stimberg, F., Oord, A.
tional Conference on Computer Vision, pp. 2736–2744,
v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neu-
2017.
ral audio synthesis. arXiv preprint arXiv:1802.08435,
2018. Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Re-
thinking the value of network pruning. In 7th Interna-
Karnin, E. D. A simple procedure for pruning back- tional Conference on Learning Representations, ICLR
propagation trained neural networks. IEEE transactions 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe-
on neural networks, 1(2):239–242, 1990. view.net, 2019. URL https://openreview.net/
forum?id=rJlnB3C5Ym.
Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., and
Shin, D. Compression of deep convolutional neural net- Louizos, C., Ullrich, K., and Welling, M. Bayesian com-
works for fast and low power mobile applications. arXiv pression for deep learning. In Advances in Neural Infor-
preprint arXiv:1511.06530, 2015. mation Processing Systems, pp. 3288–3298, 2017.
What is the State of Neural Network Pruning?
Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level Ratanaworabhan, P., Ke, J., and Burtscher, M. Fast loss-
pruning method for deep neural network compression. less compression of scientific floating-point data. In
In Proceedings of the IEEE international conference on Data Compression Conference (DCC’06), pp. 133–142.
computer vision, pp. 5058–5066, 2017. IEEE, 2006.
Mariet, Z. and Sra, S. Diversity networks: Neural network Reed, R. Pruning algorithms-a survey. IEEE Trans-
compression using determinantal point processes. arXiv actions on Neural Networks, 4(5):740–747, Septem-
preprint arXiv:1511.05077, 2015. ber 1993. ISSN 10459227. doi: 10.1109/72.
248452. URL http://ieeexplore.ieee.org/
Molchanov, D., Ashukha, A., and Vetrov, D. Variational document/248452/.
dropout sparsifies deep neural networks. In Proceed-
ings of the 34th International Conference on Machine Siedelmann, H., Wender, A., and Fuchs, M. High speed
Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017. lossless image compression. In German Conference on
Pattern Recognition, pp. 343–355. Springer, 2015.
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J.
Pruning convolutional neural networks for resource effi- Simonyan, K. and Zisserman, A. Very deep convolu-
cient inference. arXiv preprint arXiv:1611.06440, 2016. tional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
Morcos, A. S., Yu, H., Paganini, M., and Tian, Y. One ticket
to win them all: generalizing lottery ticket initializations Suau, X., Zappella, L., and Apostoloff, N. Network com-
across datasets and optimizers. arXiv:1906.02773 [cs, pression using correlation analysis of layer responses.
stat], June 2019. URL http://arxiv.org/abs/ 2018.
1906.02773. arXiv: 1906.02773.
Suzuki, T., Abe, H., Murata, T., Horiuchi, S., Ito, K.,
Mozer, M. C. and Smolensky, P. Skeletonization: A tech- Wachi, T., Hirai, S., Yukishima, M., and Nishimura,
nique for trimming the fat from a network via relevance T. Spectral-pruning: Compressing deep neural network
assessment. In Advances in neural information process- via spectral analysis. arXiv preprint arXiv:1808.08558,
ing systems, pp. 107–115, 1989a. 2018.
Mozer, M. C. and Smolensky, P. Using Relevance to Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. Efficient
Reduce Network Size Automatically. Connection processing of deep neural networks: A tutorial and sur-
Science, 1(1):3–16, January 1989b. ISSN 0954-0091, vey. arXiv preprint arXiv:1703.09039, 2017.
1360-0494. doi: 10.1080/09540098908915626. URL
https://www.tandfonline.com/doi/full/ Tan, M. and Le, Q. V. Efficientnet: Rethinking model scal-
10.1080/09540098908915626. ing for convolutional neural networks. arXiv preprint
arXiv:1905.11946, 2019.
Nola, D. Keras doesn’t reproduce caffe example code
accuracy. https://github.com/keras-team/ Tresp, V., Neuneier, R., and Zimmermann, H.-G. Early
keras/issues/4444, 11 2016. Accessed: 2019-07- brain damage. In Advances in neural information pro-
22. cessing systems, pp. 669–675, 1997.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learn-
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and ing structured sparsity in deep neural networks. In Ad-
Lerer, A. Automatic differentiation in pytorch. 2017. vances in neural information processing systems, pp.
2074–2082, 2016.
Peng, B., Tan, W., Li, Z., Zhang, S., Xie, D., and Pu, S.
Extreme network compression via filter group approxi- Yamamoto, K. and Maeno, K. Pcas: Pruning channels with
mation. In Proceedings of the European Conference on attention statistics. arXiv preprint arXiv:1806.05382,
Computer Vision (ECCV), pp. 300–316, 2018. 2018.
What is the State of Neural Network Pruning?
Zhang, X., Zou, J., He, K., and Sun, J. Accelerating very
deep convolutional networks for classification and detec-
tion. IEEE transactions on pattern analysis and machine
intelligence, 38(10):1943–1955, 2015.
Zhao, W. X., Zhang, X., Lemire, D., Shan, D., Nie, J.-Y.,
Yan, H., and Wen, J.-R. A general simd-based approach
to accelerating compression algorithms. ACM Transac-
tions on Information Systems (TOIS), 33(3):15, 2015.
Zukowski, M., Heman, S., Nes, N., and Boncz, P. Super-
scalar ram-cpu cache compression. In Data Engineering,
2006. ICDE’06. Proceedings of the 22nd International
Conference on, pp. 59–59. IEEE, 2006.
What is the State of Neural Network Pruning?
C E XPERIMENTAL S ETUP
For reproducibility purposes, ShrinkBench fixes ran-
dom seeds for all the dependencies (PyTorch, NumPy,
Python).
0.90 0.90
0.85 0.85
Accuracy
Accuracy
0.80 Global Weight 0.80
Layer Weight
Global Gradient
0.75 Layer Gradient 0.75
Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 9: Accuracy for several levels of compression Figure 10: Accuracy vs theoretical speedup for
for CIFAR-VGG on CIFAR-10 CIFAR-VGG on CIFAR-10
0.90 0.90
0.85 0.85
Accuracy
Accuracy
0.90 0.90
0.85 0.85
Accuracy
Accuracy
0.90 0.90
0.85 0.85
Accuracy
Accuracy
0.80 Global Weight 0.80 Global Weight
Layer Weight Layer Weight
Global Gradient Global Gradient
0.75 Layer Gradient 0.75 Layer Gradient
Random Random
0.70 0.70
1 2 4 8 16 32 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 15: Accuracy for several levels of compres- Figure 16: Accuracy vs theoretical speedup for
sion for ResNet-110 on CIFAR-10 ResNet-110 on CIFAR-10
Top 1 Accuracy
0.60 0.60
0.55 0.55
0.50 Global Weight 0.50 Global Weight
Layer Weight Layer Weight
0.45 Global Gradient 0.45 Global Gradient
Layer Gradient Layer Gradient
0.40 0.40
1 2 4 8 16 1 2 4 8 16 32
Compression Ratio Theoretical Speedup
Figure 17: Accuracy for several levels of compres- Figure 18: Accuracy vs theoretical speedup for
sion for ResNet-18 on ImageNet ResNet-18 on ImageNet