0% found this document useful (0 votes)

63 views41 pages

Pruning and Quantization For Deep Neural Network Acceleration: A Survey

This survey paper discusses pruning and quantization techniques for accelerating deep neural networks, particularly convolutional neural networks (CNNs). It categorizes pruning methods into static and dynamic types, compares various techniques, and analyzes their impact on network accuracy. The paper also provides practical guidance for implementing these compression techniques to optimize network performance while minimizing computational costs.

Uploaded by

tahmed222215

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views41 pages

Pruning and Quantization For Deep Neural Network Acceleration: A Survey

Uploaded by

tahmed222215

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Pruning and Quantization for Deep Neural Network Acceleration: A

Survey
Tailin Lianga,b , John Glossnera,b,c , Lei Wanga , Shaobo Shia,b and Xiaotong Zhanga,∗
a Schoolof Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
b Hua Xia General Processor Technologies, Beijing 100080, China
c General Processor Technologies, Tarrytown, NY 10591, United States

ARTICLE INFO ABSTRACT

Keywords: Deep neural networks have been applied in many applications exhibiting extraordinary abilities in
convolutional neural network the field of computer vision. However, complex network architectures challenge efficient real-time
arXiv:2101.09671v3 [cs.CV] 15 Jun 2021

neural network acceleration deployment and require significant computation resources and energy costs. These challenges can
neural network quantization be overcome through optimizations such as network compression. Network compression can often
neural network pruning be realized with little loss of accuracy. In some cases accuracy may even improve. This paper
low-bit mathematics provides a survey on two types of network compression: pruning and quantization. Pruning can be
categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare
pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs
in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning.
Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and
activations may be quantized typically to 8-bit integers although lower bit width implementations
are also discussed including binary neural networks. Both pruning and quantization can be used
independently or combined. We compare current techniques, analyze their strengths and weaknesses,
present compressed network accuracy results on a number of frameworks, and provide practical
guidance for compressing networks.

1. Introduction connections between neurons. Feed forward layers reduce

connections by considering only connections in the forward
Deep Neural Networks (DNNs) have shown extraordinary path. This reduces the number of connections to 𝑁. Other
abilities in complicated applications such as image classifica- types of components such as dropout layers can reduce the
tion, object detection, voice synthesis, and semantic segmen-
number of connections even further.
tation [138]. Recent neural network designs with billions of
Network Architecture Search (NAS) [63], also known as
parameters have demonstrated human-level capabilities but network auto search, programmatically searches for a highly
at the cost of significant computational complexity. DNNs efficient network structure from a large predefined search
with many parameters are also time-consuming to train [26]. space. An estimator is applied to each produced architecture.
These large networks are also difficult to deploy in embedded While time-consuming to compute, the final architecture of-
environments. Bandwidth becomes a limiting factor when
ten outperforms manually designed networks.
moving weights and data between Compute Units (CUs) and
Knowledge Distillation (KD) [80, 206] evolved from
memory. Over-parameterization is the property of a neural knowledge transfer [27]. The goal is to generate a simpler
network where redundant neurons do not improve the accu- compressed model that functions as well as a larger model.
racy of results. This redundancy can often be removed with KD trains a student network that tries to imitate a teacher net-
little or no accuracy loss [225]. work. The student network is usually but not always smaller
Figure 1 shows three design considerations that may con-
and shallower than the teacher. The trained student model
tribute to over-parameterization: 1) network structure, 2) net-
should be less computationally complex than the teacher.
work optimization, and 3) hardware accelerator design. These Network optimization [137] includes: 1) computational
design considerations are specific to Convolutional Neural convolution optimization, 2) parameter factorization, 3) net-
Networks (CNNs) but also generally relevant to DNNs. work pruning, and 4) network quantization. Convolution op-
Network structure encompasses three parts: 1) novel com- erations are more efficient than fully connected computations
ponents, 2) network architecture search, and 3) knowledge dis-
because they keep high dimensional information as a 3D ten-
tillation. Novel components is the design of efficient blocks sor rather than flattening the tensors into vectors that removes
such as separable convolution, inception blocks, and residual the original spatial information. This feature helps CNNs
blocks. They are discussed in Section 2.4. Network com- to fit the underlying structure of image data in particular.
ponents also encompasses the types of connections within Convolution layers also require significantly less coefficients
layers. Fully connected deep neural networks require 𝑁 2 compared to Fully Connected Layers (FCLs). Computational
∗ Corresponding author convolution optimizations include Fast Fourier Transform
[email protected] (T. Liang); [email protected] (J. (FFT) based convolution [168], Winograd convolution [135],
Glossner); [email protected] (L. Wang); [email protected] (S. Shi);
and the popular image to column (im2col) [34] approach.
[email protected] (X. Zhang)
ORCID (s): 0000-0002-7643-912X (T. Liang) We discuss im2col in detail in Section 2.3 since it is directly

T Liang et al.: Preprint submitted to Elsevier Page 1 of 41

Survey on pruning and quantization

CNN Acceleration [40, 39, 142, 137, 194, 263, 182]

Figure 1: CNN Acceleration Approaches: Follow the sense from designing to implementing, CNN acceleration could fall into three
categories, structure design (or generation), further optimization, and specialized hardware.

related to general pruning techniques. include: General Processor Technologies (GPT) [179], ARM,
Parameter factorization is a technique that decomposes nVidia, and 60+ others [202] all have processors targeting
higher-rank tensors into lower-rank tensors simplifying mem- this space. ASICs may also target both training and inference
ory access and compressing model size. It works by breaking in datacenters. Tensor processing units (TPU) from Google
large layers into many smaller ones, thereby reducing the [125], Habana from Intel [169], Kunlun from Baidu [191],
number of computations. It can be applied to both convolu- Hanguang from Alibaba [124], and Intelligence Processing
tional and fully connected layers. This technique can also be Unit (IPU) from Graphcore [121].
applied with pruning and quantization. Programmable reconfigurable FPGAs have been used for
Network pruning [201, 24, 12, 250] involves removing neural network acceleration [86, 3, 234, 152]. FPGAs are
parameters that don’t impact network accuracy. Pruning can widely used by researchers due to long ASIC design cycles.
be performed in many ways and is described extensively in Neural network libraries are available from Xilinx [128] and
Section 3. Intel [69]. Specific neural network accelerators are also being
Network quantization [131, 87] involves replacing datatypes integrated into FPGA fabrics [248, 4, 203]. Because FPGAs
with reduced width datatypes. For example, replacing 32-bit operate at the gate level, they are often used in low-bit width
Floating Point (FP32) with 8-bit Integers (INT8). The val- and binary neural networks [178, 267, 197].
ues can often be encoded to preserve more information than Neural network specific optimizations are typically in-
simple conversion. Quantization is described extensively in corporated into custom ASIC hardware. Lookup tables can
Section 4. be used to accelerate trigonometric activation functions [46]
Hardware accelerators [151, 202] are designed primarily or directly generate results for low bit-width arithmetic [65],
for network acceleration. At a high level they encompass partial products can be stored in special registers and reused
entire processor platforms and often include hardware opti- [38], and memory access ordering with specialized address-
mized for neural networks. Processor platforms include spe- ing hardware can all reduce the number of cycles to compute
cialized Central Processing Unit (CPU) instructions, Graph- a neural network output [126]. Hardware accelerators are
ics Processing Units (GPUs), Application Specific Integrated not the primary focus of this paper. However, we do note
Circuits (ASICs), and Field Programmable Gate Arrays (FP- hardware implementations that incorporate specific accelera-
GAs). tion techniques. Further background information on efficient
CPUs have been optimized with specialized Artificial processing and hardware implementations of DNNs can be
Intelligence (AI) instructions usually within specialized Sin- found in [225].
gle Instruction Multiple Data (SIMD) units [49, 11]. While We summarize our main contributions as follows:
CPUs can be used for training, they have primarily been used
for inference in systems that do not have specialized inference • We provide a review of two network compression tech-
accelerators. niques: pruning and quantization. We discuss methods
GPUs have been used for both training and inference. of compression, mathematical formulations, and com-
nVidia has specialized tensor units incorporated into their pare current State-Of-The-Art (SOTA) compression
GPUs that are optimized for neural network acceleration methods.
[186]. AMD [7], ARM [10], and Imagination [117] also • We classify pruning techniques into static and dynamic
have GPUs with instructions for neural network acceleration. methods, depending if they are done offline or at run-
Specialized ASICs have also been designed for neural time, respectively.
network acceleration. They typically target inference at the
edge, in security cameras, or on mobile devices. Examples • We analyze and quantitatively compare quantization

T Liang et al.: Preprint submitted to Elsevier Page 2 of 41

Survey on pruning and quantization

techniques and frameworks. • Kernel (𝐤 ∈ ℝ𝑘1 ×𝑘2 ) - Convolutional coefficients for a

channel, excluding biases. Typically they are square
• We provide practical guidance on quantization and (e.g. 𝑘1 = 𝑘2 ) and sized 1, 3, 7.
pruning.
• Filter (𝐰 ∈ ℝ𝑘1 ×𝑘2 ×𝑐×𝑛 ) - Comprises all of the kernels
This paper focuses primarily on network optimization corresponding to the 𝑐 channels of input features. The
for convolutional neural networks. It is organized as fol- filter’s number, 𝑛, results in different output channels.
lows: In Section 2 we give an introduction to neural networks
and specifically convolutional neural networks. We also de- • Weights - Two common uses: 1) kernel coefficients
scribe some of the network optimizations of convolutions. when describing part of a network, and 2) all the trained
In Section 3 we describe both static and dynamic pruning parameters in a neural network model when discussing
techniques. In Section 4 we discuss quantization and its ef- the entire network.
fect on accuracy. We also compare quantization libraries and
frameworks. We then present quantized accuracy results for 2.2. Training and Inference
a number of common networks. We present conclusions and CNNs are deployed as a two step process: 1) training and
provide guidance on appropriate application use in Section 5. 2) inference. Training is performed first with the result being
Finally, we present concluding comments in Section 7. either a continuous numerical value (regression) or a discrete
class label (classification). Classification training involves
2. Convolutional Neural Network applying a given annotated dataset as an input to the CNN,
propagating it through the network, and comparing the output
Convolutional neural networks are a class of feed-forward
classification to the ground-truth label. The network weights
DNNs that use convolution operations to extract features from
are then updated typically using a backpropagation strategy
a data source. CNNs have been most successfully applied to
such as Stochastic Gradient Descent (SGD) to reduce clas-
visual-related tasks however they have found use in natural
sification errors. This performs a search for the best weight
language processing [95], speech recognition [2], recommen-
values. Backpropogation is performed iteratively until a min-
dation systems [214], malware detection [223], and industrial
imum acceptable error is reached or no further reduction in
sensors time series prediction [261]. To provide a better un-
error is achieved. Backpropagation is compute intensive and
derstanding of optimization techniques, in this section, we
traditionally performed in data centers that take advantage of
introduce the two phases of CNN deployment - training and
dedicated GPUs or specialized training accelerators such as
inference, discuss types of convolution operations, describe
TPUs.
Batch Normalization (BN) as an acceleration technique for
Fine-tuning is defined as retraining a previously trained
training, describe pooling as a technique to reduce complexity,
model. It is easier to recover the accuracy of a quantized or
and describe the exponential growth in parameters deployed
pruned model with fine-tuning versus training from scratch.
in modern network structures.
CNN inference classification takes a previously trained
classification model and predicts the class from input data not
2.1. Definitions
in the training dataset. Inference is not as computationally
This section summarizes terms and definitions used to
intensive as training and can be executed on edge, mobile,
describe neural networks as well as acronyms collected in
and embedded devices. The size of the inference network
Table 1.
executing on mobile devices may be limited due to memory,
• Coefficient - A constant by which an algebraic term is bandwidth, or processing constraints [79]. Pruning discussed
multiplied. Typically, a coefficient is multiplied by the in Section 3 and quantization discussed in Section 4 are two
data in a CNN filter. techniques that can alleviate these constraints.
In this paper, we focus on the acceleration of CNN in-
• Parameter - All the factors of a layer, including coeffi- ference classification. We compare techniques using stan-
cients and biases. dard benchmarks such as ImageNet [122], CIFAR [132], and
MNIST [139]. The compression techniques are general and
• Hyperparameter - A predefined parameter before net-
the choice of application domain doesn’t restrict its use in
work training, or fine-tunning (re-training).
object detection, natural language processing, etc.
• Activation (𝐀 ∈ ℝℎ×𝑤×𝑐 ) - The activated (e.g., ReLu,
Leaky, Tanh, etc.) output of one layer in a multi-layer 2.3. Convolution Operations
network architecture, typically in height ℎ, width 𝑤, The top of Figure 2 shows a 3-channel image (e.g., RGB)
and channel 𝑐. The ℎ × 𝑤 matrix is sometimes called as input to a convolutional layer. Because the input image has
an activation map. We also denote activation as output 3 channels, the convolution kernel must also have 3 channels.
(𝐎) when the activation function does not matter. In this figure four 2 × 2 × 3 convolution filters are shown, each
consisting of three 2 × 2 kernels. Data is received from all
• Feature (𝐅 ∈ ℝℎ×𝑤×𝑐 ) - The input data of one layer, to 3 channels simultaneously. 12 image values are multiplied
distinguish the output 𝐀. Generally the feature for the with the kernel weights producing a single output. The kernel
current layer is the activation of the previous layer. is moved across the 3-channel image sharing the 12 weights.

T Liang et al.: Preprint submitted to Elsevier Page 3 of 41

Survey on pruning and quantization

Table 1 Standard Convolution

Acronyms and Abbreviations
Acronym Explanation
2D Two Dimensional
3D Three Dimensional
FP16 16-Bit Floating-Point
FP32 32-Bit Floating-Point
INT16 16-Bit Integer Separable Convolution
INT8 8-Bit Integer
IR Intermediate Representation
OFA One-For-All
RGB Red, Green, And Blue
SOTA State of The Art
AI Artificial Inteligence
BN Batch Normalization
CBN Conditional Batch Normalization Depth-wise Convolution Point-wise Convolution
CNN Convolutional Neural Network
DNN Deep Neural Network
EBP Expectation Back Propagation Figure 2: Separable Convolution: A standard convolution is
FCL Fully Connected Layer decomposed into depth-wise convolution and point-wise convo-
FCN Fully Connected Networks lution to reduce both the model size and computations.
FLOP Floating-Point Operation
GAP Global Average Pooling
GEMM General Matrix Multiply
GFLOP Giga Floating-Point Operation parallel using a GEneral Matrix Multiply (GEMM) library
ILSVRC Imagenet Large Visual Recognition Challenge [60]. Figure 3 shows a parallel column approach. The 3D
Im2col Image To Column tensors are first flattened into 2D matrices. The resulting
KD Knowledge Distillation
LRN Local Response Normalization matrices are multiplied by the convolutional kernel which
LSTM Long Short Term Memory takes each input neuron (features), multiplies it, and generates
MAC Multiply Accumulate output neurons (activations) for the next layer [138].
NAS Network Architecture Search
NN Neural Network
PTQ Post Training Quantization
QAT Quantization Aware Training Output 14 20 12 24

ReLU Rectified Linear Unit Features 15 24 17 26

RL Reinforcement Learning
RNN Recurrent Neural Network Kernels
1 1 1 1 0 1 1 0 2 1 1 2 1 1
2 2 1 1 1 0 0 1 2 1 2 0 1 0
SGD Stochastic Gradient Descent
2 0
STE Straight-Through Estimator 2 1
Input 1 2 0 0 2 1 1 2 0
1 2
ASIC Application Specific Integrated Circuit Features 1 1 3 0 3 2 0 1 3
1 1
AVX-512 Advance Vector Extension 512 (Activations) 0 2 2 1 1 0 3 3 2
1 2
CPU Central Processing Unit 1 1
CU Computing Unit 1 2 1 1 0 2 0 3 1 2 0 1 0 1 14 12

FPGA Field Programmable Gate Array 2 0 1 3 3 1 3 2 2 1 1 3 1 2 20 24

1 1 0 2 0 3 1 1 0 1 3 3 * 1 2 = 15 17
GPU Graphic Processing Unit 1 3 2 2 2 2 1 0 1 3 3 2 0 0 24 26
HSA Heterogeneous System Architecture
ISA Instruction Set Architectures
PE Processing Element Figure 3: Convolution Performance Optimization: From tradi-
SIMD Single Instruction Multiple Data tional convolution (dot squared) to image to column (im2col) -
SoC System on Chip GEMM approach, adopted from [34]. The red and green boxes
DPP Determinantal Point Process indicate filter-wise and shape-wise elements, respectively.
FFT Fast Fourier Transfer
FMA Fused Multiply-Add
KL-divergence Kullback-Leibler Divergence
LASSO Least Absolute Shrinkage And Selection Operator { }
MDP Markov Decision Process ∑
𝑀
( 𝑙 )
OLS Ordinary Least Squares 𝐅𝑙+1
𝑛 = 𝐀𝑙𝑛 = activate 𝐖𝑚𝑛 ∗ 𝐅𝑙𝑚 + 𝐛𝑙𝑛 (1)
𝑚=1

Equation 1 shows the layer-wise mathematical representa-

If the input image is 12 × 12 × 3 the resulting output will tion of the convolution layer where 𝐖 represents the weights
be 11 × 11 × 1 (using a stride of 1 and no padding). The (filters) of the tensor with 𝑚 input channels and 𝑛 output chan-
filters work by extracting multiple smaller bit maps known nels, 𝐛 represents the bias vector, and 𝐅𝑙 represents the input
as feature maps. If more filters are desired to learn different feature tensor (typically from the activation of previous layer
features they can be easily added. In this case 4 filters are 𝐀𝑙−1 ). 𝐀𝑙 is the activated convolutional output. The goal
shown resulting in 4 feature maps. of compression is to reduce the size of the 𝐖 and 𝐅 (or 𝐀)
The standard convolution operation can be computed in without affecting accuracy.

T Liang et al.: Preprint submitted to Elsevier Page 4 of 41

Survey on pruning and quantization

as in the top of Figure 5. However, the number of computa-

tions can be reduced by expanding the network width with
four types of filters as shown in Figure 5. The concatenated
result performs better than one convolutional layer with same
computation workloads [226].

Figure 4: Fully Connected Layer: Each node in a layer connects ………

to all the nodes in the next layer, and every line corresponds to
a weight value
………

Figure 4 shows a FCL - also called dense layer or dense

connect. Every neuron is connected to each other neuron ………
in a crossbar configuration requiring many weights. As an
example, if the input and output channel are 1024 and 1000,
Figure 6: Conventional Network Block (top), Residual Net-
respectively, the number of parameters in the filter will be
work Block (middle), and Densely Connected Network Block
a million by 1024 × 1000. As the image size grows or the (bottom)
number of features increase, the number of weights grows
rapidly.
A residual network architecture block [98] is a feed for-
2.4. Efficient Structure ward layer with a short circuit between layers as shown in the
The bottom of Figure 2 shows separable convolution im- middle of Figure 6. The short circuit keeps information from
plemented in MobileNet [105]. Separable convolution as- the previous block to increase accuracy and avoid vanish-
sembles a depth-wise convolution followed by a point-wise ing gradients during training. Residual networks help deep
convolution. A depth-wise convolution groups the input fea- networks grow in depth by directly transferring information
ture by channel, and treats each channel as a single input between deeper and shallower layers.
tensor generating activations with the same number of chan- The bottom of Figure 6 shows the densely connected
nels. Point-wise convolution is a standard convolution with convolutional block from DenseNets [109], this block extends
1 × 1 kernels. It extracts mutual information across the chan- both the network depth and the receptive field by delivering
nels with minimum computation overhead. For the 12×12×3 the feature of former layers to all the later layers in a dense
image previously discussed, a standard convolution needs block using concatenation. ResNets transfer outputs from a
2 × 2 × 3 × 4 multiplies to generate 1 × 1 outputs. Separable single previous layer. DenseNets build connections across
convolution needs only 2 × 2 × 3 for depth-wise convolution layers to fully utilize previous features. This provides weight
and 1 × 1 × 3 × 4 for point-wise convolution. This reduces efficiencies.
computations by half from 48 to 24. The number of weights
is also reduced from 48 to 24. 2.5. Batch Normalization
BN was introduced in 2015 to speed up the training phase,
and to improve the neural network performance [119]. Most
SOTA neural networks apply BN after a convolutional layer.
1c 3c 5c
BN addresses internal covariate shift (an altering of the net-
work activation distribution caused by modifications to pa-
1c rameters during training) by normalizing layer inputs. This
3c has been shown to reduce training time up to 14×. Santurkar
5c [210] argues that the efficiency of BN is from its ability to
3p smooth values during optimization.

Figure 5: Inception Block: The inception block computes 𝐱−𝜇

multiple convolutions with one input tensor in parallel, which 𝐲=𝛾⋅√ +𝛽 (2)
extends the receptive field by mixing the size of kernels. The 𝜎2 + 𝜖
yellow - brown coloured cubes are convolutional kernels sized Equation 2 gives the formula for computing inference
1, 3, and 5. The blue cube corresponds to a 3 × 3 pooling
BN, where 𝐱 and 𝐲 are the input feature and the output of
operation.
BN, 𝛾 and 𝛽 are learned parameters, 𝜇 and 𝜎 are the mean
value and standard deviation calculated from the training set,
The receptive field is the size of a feature map used in a and 𝜖 is the additional small value (e.g., 1e-6) to prevent the
convolutional kernel. To extract data with a large receptive denominator from being 0. The variables of Equation 2 are
filed and high precision, cascaded layers should be applied determined in the training pass and integrated into the trained

T Liang et al.: Preprint submitted to Elsevier Page 5 of 41

Survey on pruning and quantization

weights. If the features in one channel share the same parame-

ters, then it turns to a linear transform on each output channel.
Channel-wise BN parameters potentially helps channel-wise
pruning. BN could also raise the performance of the cluster-
based quantize technique by reducing parameter dependency
[48].
Since the parameters of the BN operation are not modified
in the inference phase, they may be combined with the trained
weights and biases. This is called BN folding or BN merging.
Equation 3 show an example of BN folding. The new weight
𝐖′ and bias 𝐛′ are calculated using the pretrained weights 𝐖
and BN parameters from Equation 2. Since the new weight
is computed after training and prior to inference, the number
of multiplies are reduced and therefore BN folding decreases
inference latency and computational complexity.

𝐖 𝐛−𝜇
𝐖′ = 𝛾 ⋅ √ , 𝐛′ = 𝛾 ⋅ √ +𝛽 (3)
𝜎2 + 𝜖 𝜎2 + 𝜖
2.6. Pooling
Figure 7: Popular CNN Models: Top-1 accuracy vs GFLOPs
Pooling was first published in the 1980s with neocogni-
and model size, adopted from [23]
tron [71]. The technique takes a group of values and reduces
them to a single value. The selection of the single replace-
ment value can be computed as an average of the values
(average pooling) or simply selecting the maximum value Execution time was not a factor. This incentivized neural
(max pooling). network designs with significant redundancy. As of 2020,
Pooling destroys spatial information as it is a form of models with more than 175 billion parameters have been
down-sampling. The window size defines the area of values to published [26].
be pooled. For image processing it is usually a square window Networks that execute in data centers can accommodate
with typical sizes being 2 × 2, 3 × 3 or 4 × 4. Small windows models with a large number of parameters. In resource con-
allow enough information to be propagated to successive strained environments such as edge and mobile deployments,
layers while reducing the total number of computations [224]. reduced parameter models have been designed. For exam-
Global pooling is a technique where, instead of reducing ple, GoogLeNet [226] achieves similar top-1 accuracy of
a neighborhood of values, an entire feature map is reduced to 69.78% as VGG-16 but with only 7 million parameters. Mo-
a single value [154]. Global Average Pooling (GAP) extracts bileNet [105] has 70% top-1 accuracy with only 4.2 million
information from multi-channel features and can be used with parameters and only 1.14 Giga FLoating-point OPerations
dynamic pruning [153, 42]. (GFLOPs). A more detailed network comparison can be
Capsule structures have been proposed as an alternative found in [5].
to pooling. Capsule networks replace the scalar neuron with
vectors. The vectors represent a specific entity with more 3. Pruning
detailed information, such as position and size of an object.
Capsule networks void loss of spatial information by captur- Network pruning is an important technique for both mem-
ing it in the vector representation. Rather than reducing a ory size and bandwidth reduction. In the early 1990s, pruning
neighborhood of values to a single value, capsule networks techniques were developed to reduce a trained large network
perform a dynamic routing algorithm to remove connections into a smaller network without requiring retraining [201].
[209]. This allowed neural networks to be deployed in constrained
environments such as embedded systems. Pruning removes
2.7. Parameters redundant parameters or neurons that do not significantly
Figure 7 show top-1 accuracy percent verses the number contribute to the accuracy of results. This condition may
of operations needed for a number of popular neural networks arise when the weight coefficients are zero, close to zero,
[23]. The number of parameters in each network is repre- or are replicated. Pruning consequently reduces the com-
sented by the size of the circle. A trend (not shown in the putational complexity. If pruned networks are retrained it
figure) is a yearly increase in parameter complexity. In 2012, provides the possibility of escaping a previous local minima
AlexNet [133] was published with 60 million parameters. In [43] and further improve accuracy.
2013, VGG [217] was introduced with 133 million parameters Research on network pruning can roughly be categorized
and achieved 71.1% top-1 accuracy. These were part of the as sensitivity calculation and penalty-term methods [201].
ImageNet large scale visual recognition challenge (ILSVRC) Significant recent research interest has continued showing
[207]. The competition’s metric was top-1 absolute accuracy. improvements for both network pruning categories or a fur-

T Liang et al.: Preprint submitted to Elsevier Page 6 of 41

Survey on pruning and quantization

Static Pruning
Network
Network Model Target Locating Training/Tuning
Pruning

Dynamic Pruning
Pruning Decision
Network Model Runtime Pruning
Strategy Componets

Figure 8: Pruning Categories: Static pruning is performed offline prior to inference while Dynamic pruning is performed at runtime.

ther combination of them. network which contains a series of layers (e.g., convolutional
Recently, new network pruning techniques have been cre- layer, pooling layer, etc.) with 𝑥 as input. 𝐿 represents the
ated. Modern pruning techniques may be classified by various pruned network with 𝑁𝑝 performance loss compared to the
aspects including: 1) structured and unstructured pruning de- unpruned network. Network performance is typically defined
pending if the pruned network is symmetric or not, 2) neuron as accuracy in classification. The pruning function, 𝑃 (⋅),
and connection pruning depending on the pruned element results in a different network configuration 𝑁𝑝 along with
type, or 3) static and dynamic pruning. Figure 8 shows the the pruned weights 𝐖𝑝 . The following sections are primarily
processing differences between static and dynamic pruning. concerned with the influence of 𝑃 (⋅) on 𝑁𝑝 . We also consider
Static pruning has all pruning steps performed offline prior how to obtain 𝐖𝑝 .
to inference while dynamic pruning is performed during run-
time. While there is overlap between the categories, in this 3.1. Static Pruning
paper we will use static pruning and dynamic pruning for Static pruning is a network optimization technique that
classification of network pruning techniques. removes neurons offline from the network after training and
Figure 9 shows a granularity of pruning opportunities. before inference. During inference, no additional pruning
The four rectangles on the right side correspond to the four of the network is performed. Static pruning commonly has
brown filters in the top of Figure 2. Pruning can occur three parts: 1) selection of parameters to prune, 2) the method
on an element-by-element, row-by-row, column-by-column, of pruning the neurons, and 3) optionally fine-tuning or re-
filter-by-filter, or layer-by-layer basis. Typically element-by- training [92]. Retraining may improve the performance of
element has the smallest sparsity impact, and results in a the pruned network to achieve comparable accuracy to the
unstructured model. Sparsity decreases from left-to-right in unpruned network but may require significant offline compu-
Figure 9. tation time and energy.

3.1.1. Pruning Criteria

As a result of network redundancy, neurons or connec-
tions can often be removed without significant loss of accu-
racy. As shown in Equation 1, the core operation of a network
is a convolution operation. It involves three parts: 1) input
features as produced by the previous layer, 2) weights pro-
duced from the training phase, and 3) bias values produced
from the training phase. The output of the convolution op-
element-wise channel-wise shape-wise filter-wise layer-wise eration may result in either zero valued weights or features
Figure 9: Pruning Opportunities: Different network sparsity
that lead to a zero output. Another possibility is that similar
results from the granularity of pruned structures. Shape-wise weights or features may be produced. These may be merged
pruning was proposed by Wen [241]. for distributive convolutions.
An early method to prune networks is brute-force pruning.
In this method the entire network is traversed element-wise
and weights that do not affect accuracy are removed. A disad-
arg min 𝐿 = 𝑁(𝑥; 𝐖) − 𝑁𝑝 (𝑥; 𝐖𝑝 ) vantage of this approach is the large solution space to traverse.
𝑝
(4) A typical metric to determine which values to prune is given
where 𝑁𝑝 (𝑥; 𝐖𝑝 ) = 𝑃 (𝑁 (𝑥; 𝐖)) by the 𝑙𝑝 -norm, s.t. 𝑝 ∈ {𝑁, ∞}, where 𝑁 is natural number.
The 𝑙𝑝 -norm of a vector 𝐱 which consists of 𝑛 elements is
Independent of categorization, pruning can be described
mathematically as Equation 4. 𝑁 represents the entire neural

T Liang et al.: Preprint submitted to Elsevier Page 7 of 41

Survey on pruning and quantization

mathematically described by Equation 5. or at the activation map. The most intuitive magnitude-based
pruning methods is to prune all zero-valued weights or all
( )1
∑
𝑛 𝑝 weights within an absolute value threshold.
‖𝐱‖𝑝 = |𝑥 |𝑝 (5) LeCun as far back as 1990 proposed Optimal Brain Dam-
| 𝑖|
𝑖=1 age (OBD) to prune single non-essential weights [140]. By
using the second derivative (Hessian matrix) of the loss func-
Among the widely applied measurements, the 𝑙1 -norm
tion, this static pruning technique reduced network param-
is also known as the Manhattan norm and the 𝑙2 -norm is
eters by a quarter. For a simplified derivative computation,
also known as the Euclidean norm. The corresponding 𝑙1
OBD functions under three assumptions: 1) quadratic - the
and 𝑙2 regularization have the names LASSO (least absolute
cost function is near-quadratic, 2) extremal - the pruning is
shrinkage and selection operator) and Ridge, respectively
done after the network converged, and 3) diagonal - sums
[230]. The difference between the 𝑙2 -norm pruned tensor
up the error of individual weights by pruning the result of
and an unpruned tensor is called the 𝑙2 -distance. Sometimes
the error caused by their co-consequence. This research also
researchers also use the term 𝑙0 -norm defined as the total
suggested that the sparsity of DNNs could provide opportuni-
number of nonzero elements in a vector.
ties to accelerate network performance. Later Optimal Brain
Surgeon (OBS) [97] extended OBD with a similar second-
⎧𝑁 ( )2 ⎫ order method but removed the diagonal assumption in OBD.
⎪∑ ∑𝑝
⎪ OBS considers the Hessian matrix is usually non-diagonal
arg min ⎨ 𝑦𝑖 − 𝛼 − 𝛽𝑗 𝐱𝑖𝑗 ⎬
𝛼,𝛽
⎪ 𝑖=1 𝑗=1 ⎪ for most applications. OBS improved the neuron removal
⎩ ⎭ (6) precision with up to a 90% reduction in weights for XOR
∑𝑝 networks.
| |
subject to |𝛽𝑗 | ⩽ 𝑡 These early methods reduced the number of connections
| |
𝑗 based on the second derivative of the loss function. The
training procedure did not consider future pruning but still re-
Equation Equation 6 mathematically describes 𝑙2 LASSO
sulted in networks that were amenable to pruning. They also
regularization. Consider a sample consisting of 𝑁 cases, each
suggested that methods based on Hessian pruning would ex-
of which consists of 𝑝 covariates and a single outcome 𝑦𝑖 .
hibit higher accuracy than those pruned with only magnitude-
Let 𝑥𝑖 = (𝑥𝑖1 , ..., 𝑥𝑖𝑝 )𝑇 be the standardized covariate vec-
based algorithms [97]. More recent DNNs exhibit larger
tor for the 𝑖-th case (input feature in DNNs), so we have
∑ ∑ 2 weight values when compared to early DNNs. Early DNNs
𝑖 𝑥𝑖𝑗 ∕𝑁 = 0, 𝑖 𝑥𝑖𝑗 ∕𝑁 = 1. 𝛽 represents the coefficients were also much shallower with orders of magnitude less neu-
𝛽 = (𝛽1 , ..., 𝛽𝑝 ) (weights) and 𝑡 is a predefined tunning pa-
𝑇
rons. GPT-3 [26], for example, contains 175-billion param-
rameter that determines the sparsity. The LASSO estimate 𝛼 eters while VGG-16 [217] contains just 133-million param-
is 0 when the average of 𝑦𝑖 is 0 because for all 𝑡, the solution eters. Calculating the Hessian matrix during training for
∑
for 𝛼 is 𝛼 = 𝑦. If the constraint is 𝑝𝑗 𝛽𝑗2 ⩽ 𝑡 then the Equa- networks with the complexity of GPT-3 is not currently fea-
tion 6 becomes Ridge regression. Removing the constraint sible as it has the complexity of 𝑂(𝑊 2 ). Because of this
will results in the Ordinary Least Squares (OLS) solution. simpler magnitude-based algorithms have been developed
[177, 141].
{ } Filter-wise pruning [147] uses the 𝑙1 -norm to remove
1
𝑎𝑟𝑔 min ‖𝑦 − 𝐗𝛽‖22 + 𝜆 ‖𝛽‖1 (7) filters that do not affect the accuracy of the classification.
𝛽∈ℝ 𝑁
Pruning entire filters and their related feature maps resulted
Equation 6 can be simplified into the so-called Lagrangian in a reduced inference cost of 34% for VGG-16 and 38% for
form shown in Equation 7. The Lagrangian multiplier trans- ResNet-110 on the CIFAR-10 dataset with improved accuracy
lates the objective function 𝑓 (𝑥) and constraint 𝑔(𝑥) = 0 into 0.75% and 0.02%, respectively.
the format of (𝑥, 𝜆) = 𝑓 (𝑥) − 𝜆𝑔(𝑥), Where the ‖ ⋅ ‖𝑝 is the Most network pruning methods choose to measure weights
standard 𝑙𝑝 -norm, the 𝐗 is the covariate matrix that contains rather than activations when rating the effectiveness of prun-
𝑥𝑖𝑗 , and 𝜆 is the data dependent parameter related to 𝑡 from ing [88]. However, activations may also be an indicator to
Equation 6. prune corresponding weights. Average Percentage Of Zeros
Both magnitude-based pruning and penalty based pruning (APoZ) [106] was introduced to judge if one output activa-
may generate zero values or near-zero values for the weights. tion map is contributing to the result. Certain activation
In this section we discuss both methods and their impact. functions, particularly rectification such as Rectified Linear
Unit (ReLU), may result in a high percentage of zeros in
Magnitude-based pruning: It has been proposed and is activations and thus be amenable to pruning. Equation 8
widely accepted that trained weights with large values are
shows the definition of APoZ(𝑖)𝑐 of the 𝑐-th neuron in the 𝑖-th
more important than trained weights with smaller values
[143]. This observation is the key to magnitude-based meth- layer, where 𝐎(𝑖)𝑐 denotes the activation, 𝑁 is the number of
ods. Magnitude-based pruning methods seek to identify un- calibration (validation) images, and 𝑀 is the dimension of
needed weights or features to remove them from runtime eval-
uation. Unneeded values may be pruned either in the kernel

T Liang et al.: Preprint submitted to Elsevier Page 8 of 41

Survey on pruning and quantization
Wn(l)l ,:,:,: (1)

activation map. 𝑓 (true) = 1 and 𝑓 (false) = 0. channel-wise W:,c

(l)
l ,:,:
(2) shortcut
(l)
W:,cl ,ml ,kl (3)
𝑁 ∑
∑ 𝑀 ( )

W (l) (4) Wn(l)l ,:,:,: (1)

𝑓 𝐎(𝑖) (𝑘) = 0 Wn(l)l ,:,:,: (1)

𝑐,𝑗
( )

(l)
(2)

…
𝑘=0 𝑗=0 W:,c

l ,:,:

(l)
(2)
APoZ(𝑖) (𝑖)
(8) W:,c

𝑐 = APoZ 𝐎𝑐 = l ,:,:
shape-wise (l)
(3)

W:,cl ,ml ,kl
𝑁 ×𝑀 (l)
W:,cl ,ml ,kl (3)
(l)
filter-wise Wnl ,:,:,: (1) depth-wise W (l) (4)
Similarly, inbound pruning [195], also an activation tech- (l)
W:,c (2)
W (l) (4)
l ,:,:
nique, considers channels that do not contribute to the result. Figure 10: Types
W:,c ,m ,k of Sparsity(3)Geometry, adopted from [241]
(l)

If the top activation channel in the standard convolution of

l l l

W (l) (4)
Figure 2 are determined to be less contributing, the corre-
sponding channel of the filter in the bottom of the figure will as a whole. Equation 9 gives the pruning constraint where 𝐗
be removed. After pruning this technique achieved about and 𝛽 in Equation 7 are replaced by the higher dimensional
1.5× compression. 𝐗𝐣 and 𝛽𝑗 for the 𝑗 groups.
Filter-wise pruning using a threshold from the sum of
filters’ absolute values can directly take advantage of the 1

structure in the network. In this way, the ratio of pruned to ⎧‖ ‖2 ⎫

unpruned neurons (i.e. the pruning ratio) is positively cor- ⎪‖ ∑𝐽
‖ ∑𝐽
‖ ‖ ⎪
𝑎𝑟𝑔 min𝑝 ⎨‖
‖𝑦− 𝐗𝑗 𝛽𝑗 ‖
‖ +𝜆 ‖𝛽𝑗 ‖ ⎬ (9)
related to the percentage of kernel weights with zero values, ‖ ‖𝐾𝑗
𝛽∈ℝ
⎪‖ ‖ ⎪
⎩‖ ‖2
𝑗=1 𝑗=1
which can be further improved by penalty-based methods. ⎭ 1
1

Penalty-based pruning: In penalty-based pruning, the goal Figure 10 shows Group LASSO with group shapes used
is to modify an error function or add other constraints, known in Structured Sparsity Learning (SSL) [241]. Weights are
1

as bias terms, in the training process. A penalty value is used split into multiple groups. Unneeded groups of weights are
to update some weights to zero or near zero values. These removed using LASSO feature selection. Groups may be
values are then pruned. determined based on geometry, computational complexity,
Hanson [96] explored hyperbolic and exponential bias group sparsity, etc. SSL describes an example where group
terms for pruning in the late 80s. This method uses weight sparsity in row and column directions may be used to reduce
decay in backpropagation to determine if a neuron should be the execution time of GEMM. SSL has shown improved
pruned. Low-valued weights are replaced by zeros. Residual inference times on AlexNet with both CPUs and GPUs by
zero valued weights after training are then used to prune 5.1× and 3.1×, respectively.
unneeded neurons. Group-wise brain damage [136] also introduced the group
Feature selection [55] is a technique that selects a subset LASSO constraint but applied it to filters. This simulates
of relevant features that contribute to the result. It is also brain damage and introduces sparsity. It achieved 2× speedup
known as attribute selection or variable selection. Feature se- with 0.7% ILSVRC-2012 accuracy loss on the VGG Network.
lection helps algorithms avoiding over-fitting and accelerates Sparse Convolutional Neural Networks (SCNN) [17] take
both training and inference by removing features and/or con- advantage of two-stage tensor decomposition. By decompos-
nections that don’t contribute to the results. Feature selection ing the input feature map and convolutional kernels, the ten-
also aids model understanding by simplifying them to the sors are transformed into two tensor multiplications. Group
most important features. Pruning in DNNs can be considered LASSO is then applied. SCNN also proposed a hardware
to be a kind of feature selection [123]. friendly algorithm to further accelerate sparse matrix compu-
LASSO was previously introduced as a penalty term. tations. They achieved 2.47× to 6.88× speed-up on various
LASSO shrinks the least absolute valued feature’s corre- types of convolution.
sponding weights. This increases weight sparsity. This op- Network slimming [158] applies LASSO on the scaling
eration is also referred to as LASSO feature selection and factors of BN. BN normalizes the activation by statistical
has been shown to perform better than traditional procedures parameters which are obtained during the training phase. Net-
such as OLS by selecting the most significantly contributed work slimming has the effect of introducing forward invisible
variables instead of using all the variables. This lead to ap- additional parameters without additional overhead. Specifi-
proximately 60% more sparsity than OLS [181]. cally, by setting the BN scaler parameter to zero, channel-wise
Element-wise pruning may result in an unstructured net- pruning is enabled. They achieved 82.5% size reduction with
work organizations. This leads to sparse weight matrices that VGG and 30.4% computation compression without loss of
are not efficiently executed on instruction set processors. In accuracy on ILSVRC-2012.
addition they are usually hard to compress or accelerate with- Sparse structure selection [111] is a generalized network
out specialized hardware support [91]. Group LASSO [260] slimming method. It prunes by applying LASSO to sparse
mitigates these inefficiencies by using a structured pruning scaling factors in neurons, groups, or residual blocks. Using
method that removes entire groups of neurons while main- an improved gradient method, Accelerated Proximal Gradi-
taining structure in the network organization [17]. ent (APG), the proposed method shows better performance
Group LASSO is designed to ensure that all the variables without fine-tunning achieving 4× speed-up on VGG-16 with
sorted into one group could be either included or excluded 3.93% ILSVRC-2012 top-1 accuracy loss.

T Liang et al.: Preprint submitted to Elsevier Page 9 of 41

Survey on pruning and quantization

Dropout: While not specifically a technique to prune net- Labeled Faces in the Wild (LFW) dataset [110] in the filed
works, dropout does reduce the number of parameters [222]. of face recognition.
It was originally designed as a stochastic regularizer to avoid A method that iteratively removes redundant neurons for
over-fitting of data [103]. The technique randomly omits a FCLs without requiring special validation data is proposed
percentage of neurons typically up to 50%, This dropout op- in [221]. This approach measures the similarity of weight
eration breaks off part of the connections between neurons to groups after a normalization. It removes redundant weights
avoid co-adaptations. Dropout could also be regarded as an and merges the weights into a single value. This lead to a
operation that separately trains many sub-networks and takes 34.89% reduction of FCL weights on AlexNet with 2.24%
the average of them during the inference phase. Dropout in- top-1 accuracy loss on ILSVRC-2012.
creases training overhead but it does not affect the inference Comparing with the similarity based approach above, DI-
time. Versity NETworks (DIVNET) [167] considers the calculation
Sparse variational dropout [176] added a dropout hyper- redundancy based on the activations. DIVNET introduces
parameter called the dropout rate to reduce the weights of Determinantal Point Process (DPP) [166] as a pruning tool.
VGG-like networks by 68×. During training the dropout rate DPP sorts neurons into categories including dropped and
can be used to identify single weights to prune. This can also retained. Instead of forcing the removal of elements with
be applied with other compression approaches for further low contribution factors, they fuse the neurons by a process
reduction in weights. named re-weighting. Re-weighting works by minimizing the
impact of neuron removal. This minimizes pruning influence
Redundancies: The goal of norm-based pruning algorithms and mitigates network information loss. They found 3% loss
is to remove zeros. This implies that the distribution of values on CIFAR-10 dataset when compressing the network into
should wide enough to retain some values but contain enough half weight.
values close to zero such that a smaller network organization ThiNet [164] adopts statistics information from the next
is still accurate. This does not hold in some circumstances. layer to determine the importance of filters. It uses a greedy
For example, filters that have small norm deviations or a large search to prune the channel that has the smallest reconstruc-
minimum norm have small search spaces making it difficult tion cost in the next layer. ThiNet prunes layer-by-layer in-
to prune based on a threshold [100]. Even when parameter stead of globally to minimize large errors in classification
values are wide enough, in some networks smaller values accuracy. It also prunes less during each training epoch to
may still play an important role in producing results. One allow for coefficient stability. The pruning ratio is a prede-
example of this is when large valued parameters saturate [64]. fined hyper-parameter and the runtime complexity is directly
In these cases magnitude-based pruning of zero values may related to the pruning ratio. ThiNet compressed ResNet-50
decrease result accuracy. FLOPs to 44.17% with a top-1 accuracy reduction of 1.87%.
Similarly, penalty-based pruning may cause network ac- He [101] adopts LASSO regression instead of a greedy
curacy loss. In this case, the filters identified as unneeded algorithm to estimate the channels. Specifically, in one itera-
due to similar coefficient values in other filters may actually tion, the first step is to evaluate the most important channel
be required. Removing them may significantly decrease net- using the 𝑙1 -norm. The next step is to prune the correspond-
work accuracy [88]. Section 3.1.2 describes techniques to ing channel that has the smallest Mean Square Error (MSE).
undo pruning by tuning the weights to minimize network loss Compared to an unpruned network, this approach obtained
while this section describes redundancy based pruning. 2× acceleration of ResNet-50 on ILSVRC-2012 with about
Using BN parameters, feature map channel distances can 1.4% accuracy loss on top-5, and a 4× reduction in execution
be computed by layer [266]. Using a clustering approach time with top-5 accuracy loss of 1.0% for VGG-16. The au-
for distance, nearby features can be tuned. An advantage thors categorize their approach as dynamic inference-time
of clustering is that redundancy is not measured with an channel pruning. However it requires 5000 images for cal-
absolute distance but a relative value. With about 60 epochs ibration with 10 samples per image and more importantly
of training they were able to prune the network resulting results in a statically pruned network. Thus we have placed
in a 50% reduction in FLOPs (including non-convolutional it under static pruning.
operations) with a reduction in accuracy of only 1% for both
top-1 and top-5 on the ImageNet dataset. 3.1.2. Pruning combined with Tuning or Retraining
Filter pruning via geometric median (FPGM) [100] iden- Pruning removes network redundancies and has the bene-
tifies filters to prune by measuring the 𝑙2 -distance using the fit of reducing the number of computations without significant
geometric median. FPGM found 42% FLOPs reduction with impact on accuracy for some network architectures. However,
0.05% top-1 accuracy drop on ILSVRC-2012 with ResNet- as the estimation criterion is not always accurate, some im-
101. portant elements may be eliminated resulting in a decrease in
The reduce and reused (also described as outbound) accuracy. Because of the loss of accuracy, time-consuming
method [195] prunes entire filters by computing the statis- fine-tuning or re-training may be employed to increase accu-
tical variance of each filter’s output using a calibration set. racy [258].
Filters with low variance are pruned. The outbound method Deep compression [92], for example, describes a static
obtained 2.37× acceleration with 1.52% accuracy loss on method to prune connections that don’t contribute to classi-

T Liang et al.: Preprint submitted to Elsevier Page 10 of 41

Survey on pruning and quantization

fication accuracy. In addition to feature map pruning they ResNet classification accuracy with only 5% to 10% size of
also remove weights with small values. After pruning they original weights.
re-train the network to improve accuracy. This process is AutoPruner [163] integrated the pruning and fine-tuning
performed iteratively three times resulting in a 9× to 13× of a three-stage pipeline as an independent training-friendly
reduction in total parameters with no loss of accuracy. Most layer. The layer helped gradually prune during training even-
of the removed parameters were from FCLs. tually resulting in a less complex network. AutoPruner pruned
73.59% of compute operations on VGG-16 with 2.39% ILSVRC-
Recoverable Pruning: Pruned elements usually cannot be 2012 top-1 loss. ResNet-50 resulted in a 65.80% of compute
recovered. This may result in reduced network capability. operations with 3.10% loss of accuracy.
Recovering lost network capability requires significant re-
training. Deep compression required millions of iterations to Training from Scratch: Observation shows that network
retrain the network [92]. To avoid this shortcoming, many ap- training efficiency and accuracy is inversely proportional
proaches adopt recoverable pruning algorithms. The pruned to structure sparsity. The more dense the network, the less
elements may also be involved in the subsequent training training time [94, 147, 70]. This is one reason that current
process and adjust themselves to fit the pruned network. pruning techniques tend to follow a train-prune-tune pipeline
Guo [88] describes a recoverable pruning method using rather than training a pruned structure from scratch.
binary mask matrices to indicate whether a single weight However, the lottery ticket hypothesis [70] shows that it is
value is pruned or not. The 𝑙1 -norm pruned weights can be not of primary importance to preserve the original weights but
stochastically spliced back into the network. Using this ap- the initialization. Experiments show that dense, randomly-
proach AlexNet was able to be reduced by a factor of 17.7× initialized pruned sub-networks can be trained effectively
with no accuracy loss. Re-training iterations were signifi- and reach comparable accuracy to the original network with
cantly reduced to 14.58% of Deep compression [92]. How- the same number of training iterations. Furthermore, stan-
ever this type of pruning still results in an asymmetric network dard pruning techniques can uncover the aforementioned
complicating hardware implementation. sub-networks from a large oversized network - the Winning
Soft Filter Pruning (SFP) [99] further extended recov- Tickets. In contrast with current static pruning techniques,
erable pruning using a dimension of filter. SFP obtained the lottery ticket hypothesis after a period of time drops all
structured compression results with an additional benefit or well-trained weights and resets them to an initial random
reduced inference time. Furthermore, SFP can be used on state. This technique found that ResNet-18 could maintain
difficult to compress networks achieving a 29.8% speed-up comparable performance with a pruning ratio up to 88.2% on
on ResNet-50 with 1.54% ILSVRC-2012 top-1 accuracy loss. the CIFAR-10 dataset.
Comparing with Guo’s recoverable weight [88] technique,
SFP achieves inference speed-ups closer to theoretical re- Towards Better Accuracy: By reducing the number of net-
sults on general purpose hardware by taking advantage of the work parameters, pruning techniques can also help to reduce
structure of the filter. over-fitting. Dense-Sparse-Dense (DSD) training [93] helps
various network improve classification accuracy by 1.1% to
Increasing Sparsity: Another motivation to apply fine-tuning 4.3%. DSD uses a three stage pipeline: 1) dense training to
is to increase network sparsity. Sparse constraints [270] ap- identify important connections, 2) prune insignificant weights
plied low rank tensor constraints [157] and group sparsity and sparse training with a sparsity constraint to take reduce
[57] achieving a 70% reduction of neurons with a 0.57% drop the number of parameters, and 3) re-dense the structure to
of AlexNet in ILSVRC-2012 top-1 accuracy. recover the original symmetric structure, this also increase
the model capacity. The DSD approach has also shown im-
Adaptive Sparsity: No matter what kind of pruning criteria pressive performance on the other type of deep networks such
is applied, a layer-wise pruning ratio usually requires a human as Recurrent Neural Networks (RNNs) and Long Short Term
decision. Too high a ratio resulting in very high sparsity may Memory networks (LSTMs).
cause the network to diverge requiring heavy re-tuning.
Network slimming [158], previously discussed, addresses 3.2. Dynamic Pruning
this problem by automatically computing layer-wise sparsity. Except for recoverable techniques, static pruning perma-
This achieved a 20× model size compression, 5× computing nently destroys the original network structure which may lead
reduction, and less than 0.1% accuracy loss on the VGG to a decrease in model capability. Techniques have been re-
network. searched to recover lost network capabilities but once pruned
Pruning can also be performed using a min-max optimiza- and re-trained, the static pruning approach can’t recover de-
tion module [218] that maintains network accuracy during stroyed information. Additionally, observations shows that
tuning by keeping a pruning ratio. This technique compressed the importance of neuron binding is input-independent [73].
the VGG network by a factor of 17.5× and resulted in a theo- Dynamic pruning determines at runtime which layers,
retical execution time (FLOPs) of 15.56% of the unpruned channels, or neurons will not participate in further activity.
network. A similar approach was proposed with an estima- Dynamic pruning can overcome limitations of static prun-
tion of weights sets [33]. By avoiding the use of a greedy ing by taking advantage of changing input data potentially
search to keep the best pruning ratio, they achieved the same reducing computation, bandwidth, and power dissipation. Dy-

T Liang et al.: Preprint submitted to Elsevier Page 11 of 41

Survey on pruning and quantization

Decision Components
Network Info
Decision Data
Network Data
1.Additional connections or side networks?
Side
2.Layer-wise pruning or channel-wise?
Network 2 5
3.One-shot information input or layer-wise?
Pruning
Decision 4.How to calculate the score?
1 4 7
Additional Connections 5.Predefined thresholds or dynamical?
attached to Network A
6.Continue, skip or exit computing?
7.How to train the decision components?
3

6 ... ...
Input
Image(s)
6
Network A Network B Network C

Cascade Network

Exit Exit Exit Exit Exit

Figure 11: Dynamic Pruning System Considerations

namic pruning typically doesn’t perform runtime fine-tuning ending the computation and outputing the predicting
or re-training. In Figure 11, we show an overview of dynamic results [68, 145, 148]. In this case the remaining layers
pruning systems. The most important consideration is the de- are considered to be pruned.
cision system that decides what to prune. The related issues 7. Training the decision component: a) attached con-
are: nections can be trained along with the original net-
1. The type of the decision components: a) additional work [145, 148, 73], b) side networks are typically
connections attached to the original network used dur- trained using reinforcement learning (RL) algorithms
ing the inference phase and/or the training phase, b) [19, 153, 189, 246].
characteristics of the connections that can be learned For instruction set processors, feature maps or the number
by standard backpropagation algorithms [73], and c) a of filters used to identify objects is a large portion of band-
side decision network which tends to perform well but width usage [225] - especially for depth-wise or point-wise
is often difficult to train [153]. convolutions where features consume a larger portion of the
2. The pruning level (shape): a) channel-wise [153, 73, bandwidth [47]. Dynamic tuning may also be applied to stat-
42], b) layer-wise [145], c) block-wise [246], or d) ically pruned networks potentially further reducing compute
network-wise [25]. The pruning level chosen influ- and bandwidth requirements.
ences hardware design. A drawback of dynamic pruning is that the criteria to
3. Input data: a) one-shot information feeding [246] feeds determine which elements to prune must be computed at run-
the entire input to the decision system, and b) layer- time. This adds overhead to the system requiring additional
wise information feeding [25, 68] where a window of compute, bandwidth, and power. A trade-off between dy-
data is iteratively fed to the decision system along with namic pruning overhead, reduced network computation, and
the forwarding. accuracy loss, should be considered. One method to miti-
4. Computing a decision score: 𝑙𝑝 -norm [73], or b) other gate power consumption inhibits computations from 0-valued
approaches [108]. parameters within a Processing Element (PE) [153].
5. Score comparison: a) human experience/experiment
3.2.1. Conditional Computing
results [145] or b) automatic threshold or dynamic
Conditional computing involves activating an optimal
mechanisms [108].
part of a network without activating the entire network. Non-
6. Stopping criteria: a) in the case of layer-wise and
activated neurons are considered to be pruned. They do
network-wise pruning, some pruning algorithms skip
not participate in the result thereby reducing the number of
the pruned layer/network [19, 246], b) some algorithms
computations required. Conditional computing applies to
dynamically choose the data path [189, 259], and c)

T Liang et al.: Preprint submitted to Elsevier Page 12 of 41

Survey on pruning and quantization

training and inference [20, 56]. RL. The MDP reward function in the state-action-reward
Conditional computing has a similarity with RL in that sequence is computation efficiency. Rather than removing
they both learn a pattern to achieve a reward. Bengio [19] layers, a side network of RNP predicts which feature maps are
split the network into several blocks and formulates the block not needed. They found 2.3× to 5.9× reduction in execution
chosen policies as an RL problem. This approach consists time with top-5 accuracy loss from 2.32% to 4.89% for VGG-
of only fully connected neural networks and achieved a 5.3× 16.
speed-up on CIFAR-10 dataset without loss of accuracy.
3.2.3. Differentiable Adaptive Networks
3.2.2. Reinforcement Learning Adaptive Networks Most of the aforementioned decision components are non-
Adaptive networks aim to accelerating network inference differential, thus computationally expensive RL is adopted
by conditionally determining early exits. A trade-off be- for training. A number of techniques have been developed to
tween network accuracy and computation can be applied reduce training complexity by using differentiable methods.
using thresholds. Adaptive networks have multiple interme- Dynamic channel pruning [73] proposes a method to dy-
diate classifiers to provide the ability of an early exit. A namically select which channel to skip or to process using
cascade network is a type of adaptive network. Cascade net- Feature Boosting and Suppression (FBS). FBS is a side net-
works are the combinations of serial networks which all have work that guides channel amplification and omission. FBS is
output layers rather than per-layer outputs. Cascade networks trained along with convolutional networks using SGD with
have a natural advantage of an early exit by not requiring LASSO constraints. The selecting indicator can be merged
all output layers to be computed. If the early accuracy of a into BN parameters. FBS achieved 5× acceleration on VGG-
cascade network is not sufficient, inference could potentially 16 with 0.59% ILSVRC-2012 top-5 accuracy loss, and 2×
be dispatched to a cloud device [145, 25]. A disadvantage of acceleration on ResNet-18 with 2.54% top-1, 1.46% top-5
adaptive networks is that they usually need hyper-parameters accuracy loss.
optimized manually (e.g., confidence score [145]). This intro- Another approach, Dynamic Channel Pruning (DCP)
duces automation challenges as well as classification accuracy [42] dynamically prunes channels using a channel thresh-
loss. They found 28.75% test error on CIFAR-10 when set- old weighting (T-Weighting) decision. Specifically, this mod-
ting the threshold to 0.5. A threshold of 0.99 lowered the ule prunes the channels whose score is lower than a given
error to 15.74% at a cost of 3x to inference time. threshold. The score is calculated by a T-sigmoid activation
A cascading network [189] is an adaptive network with function, which is mathematically described in Equation 10,
an RL trained Composer that can determine a reasonable where 𝜎(𝑥) = 1∕(1 + 𝑒−𝑥 ) is the sigmoid function. The input
computation graph for each input. An adaptive controller to the T-sigmoid activation function is down sampled by a
Policy Preferences is used to intelligently enhance the Com- FCL from the feature maps. The threshold is found using
poser allowing an adjustment of the network computation iterative training which can be a computationally expensive
graph from sub-graphs. The Composer performs much better process. DCP increased VGG-16 top-5 error by 4.77% on
in terms of accuracy than the baseline network with the same ILSVRC-2012 for 5× computation speed-up. By comparison,
number of computation-involved parameters on a modified RNP increased VGG-16 top-5 error by 4.89% [153].
dataset, namely Wide-MNIST. For example, when invoking {
1k parameters, the baseline achieves 72% accuracy while the 𝜎(𝑥), if 𝑥 > 𝑇
ℎ(𝑥) = (10)
Composer obtained 85%. 0, otherwise
BlockDrop [246] introduced a policy network that trained
using RL to make an image-specific determination whether The cascading neural network by Leroux [145] reduced
a residual network block should participate in the follow- the average inference time of overfeat network [211] by 40%
ing computation. While the other approaches compute an with a 2% ILSVRC-2012 top-1 accuracy loss. Their criteria
exit confidence score per layer, the policy network runs only for early exit is based on the confidence score generated by an
once when an image is loaded. It generates a boolean vec- output layer. The auxiliary layers were trained with general
tor that indicates which residual blocks are activate or in- backpropagation. The adjustable score threshold provides a
active. BlockDrop adds more flexibility to the early exit trade-off between accuracy and efficiency.
mechanism by allowing a decision to be made on any block Bolukbasi [25] reports a system that contains a com-
and not just early blocks in Spatially Adaptive Computation bination of other SOTA networks (e.g., AlexNet, ResNet,
Time (SACT) [68]. This is discussed further in Section 3.2.3. GoogLeNet, etc.). A policy adaptively chooses a point to
BlockDrop achieves an average speed-up of 20% on ResNet- exit early. This policy can be trained by minimizing its cost
101 for ILSVRC-2012 without accuracy loss. Experiments function. They format the system as a directed acyclic graph
using the CIFAR dataset showed better performance than with various pre-trained networks as basic components. They
other SOTA counterparts at that time [68, 82, 147]. evaluate this graph to determine leaf nodes for early exit.
Runtime Neural Pruning (RNP) [153] is a framework The cascade of acyclic graphs with a combination of various
that prunes neural networks dynamically. RNP formulates networks reduces computations while maintaining predic-
the feature selection problem as a Markov Decision Process tion accuracy. ILSVRC-2012 experiments show ResNet-50
(MDP) and then trains an RNN-based decision network by acceleration of 2.8× with 1% top-5 accuracy loss and 1.9×
speed-up with no accuracy loss.

T Liang et al.: Preprint submitted to Elsevier Page 13 of 41

Survey on pruning and quantization

Considering the similarity of RNNs and residual networks This implies the pruned architecture itself is crucial to suc-
[83], Spatially Adaptive Computation Time (SACT) [68] cess. By this observation, the pruning algorithms could be
explored an early stop mechanism of residual networks in seen as a type of NAS. Liu concluded that because the weight
the spatial domain. SACT can be applied to various tasks values can be re-trained, by themselves they are not effica-
including image classification, object detection, and image cious. However, the lottery ticket hypothesis [70] achieved
segmentation. SACT achieved about 20% acceleration with comparable accuracy only when the weight initialization
no accuracy loss for ResNet-101 on ILSVRC-2012. was exactly the same as the unpruned model. Glae [72]
To meet the computation constraints, Multi-Scale Dense resolved the discrepancy by showing that what really matters
Networks (MSDNets) [108] designed an adaptive network is the pruning form. Specifically, unstructured pruning can
using two techniques: 1) an anytime-prediction to generate only be fine-tuned to restore accuracy but structured pruning
prediction results at many nodes to facilitate the network’s can be trained from scratch. In addition, they explored the
early exit and 2) batch computational budget to enforce a performance of dropout and 𝑙0 regularization. The results
simpler exit criteria such as a computation limit. MSDNets showed that simple magnitude based pruning can perform
combine multi-scale feature maps [265] and dense connec- better. They developed a magnitude based pruning algorithm
tivity [109] to enable accurate early exit while maintaining and showed the pruned ResNet-50 obtained higher accuracy
higher accuracy. The classifiers are differentiable so that than SOTA at the same computational complexity.
MSDNets can be trained using stochastic gradient descent.
MSDNets achieve 2.2× speed-up at the same accuracy for
ResNet-50 on ILSVRC-2012 dataset. 4. Quantization
To address the training complexity of adaptive networks, Quantization is known as the process of approximating
Li [148] proposed two methods. The first method is gradient a continuous signal by a set of discrete symbols or integer
equilibrium (GE). This technique helps backbone networks values. Clustering and parameter sharing also fall within
converge by using multiple intermediate classifiers across this definition [92]. Partial quantization uses clustering al-
multiple different network layers. This improves the gradi- gorithms such as k-means to quantize weight states and then
ent imbalance issue found in MSDNets [108]. The second store the parameters in a compressed file. The weights can be
method is an Inline Subnetwork Collaboration (ISC) and a decompressed using either a lookup table or a linear transfor-
One-For-All knowledge distillation (OFA). Instead of inde- mation. This is typically performed during runtime inference.
pendently training different exits, ISC takes early predictions This scheme only reduces the storage cost of a model. This
into later predictors to enhance their input information. OFA is discussed in Section 4.2.4. In this section we focus on
supervises all the intermediate exits using a final classifier. At numerical low-bit quantization.
a same ILSVRC-2012 top-1 accuracy of 73.1%, their network Compressing CNNs by reducing precision values has
takes only one-third the computational budget of ResNet. been previously proposed. Converting floating-point parame-
Slimmable Neural Networks (SNN) [259] are a type of ters into low numerical precision datatypes for quantizing neu-
networks that can be executed at different widths. Also known ral networks was proposed as far back as the 1990s [67, 14].
as switchable networks, the network enables dynamically Renewed interest in quantization began in the 2010s when 8-
selecting network architectures (width) without much compu- bit weight values were shown to accelerate inference without
tation overhead. Switchable networks are designed to adap- a significant drop in accuracy [233].
tively and efficiently make trade-offs between accuracy and Historically most networks are trained using FP32 num-
on-device inference latency across different hardware plat- bers [225]. For many networks an FP32 representation has
forms. SNN found that the difference of feature mean and greater precision than needed. Converting FP32 parameters
variance may lead to training faults. SNN solves this issue to lower bit representations can significantly reduce band-
with a novel switchable BN technique and then trains a wide width, energy, and on-chip area.
enough network. Unlike cascade networks which primar- Figure 12 shows the evolution of quantization techniques.
ily benefit from specific blocks, SNN can be applied with Initially, only weights were quantized. By quantizing, cluster-
many more types of operations. As BN already has two pa- ing, and sharing, weight storage requirements can be reduced
rameters as mentioned in Section 2, the network switch that by nearly 4×. Han [92] combined these techniques to reduce
controls the network width comes with little additional cost. weight storage requirements from 27MB to 6.9MB. Post train-
SNN increased top-1 error by 1.4% on ILSVRC-2012 while ing quantization involves taking a trained model, quantizing
achieving about 2× speed-up. the weights, and then re-optimizing the model to generate a
quantized model with scales [16]. Quantization-aware train-
3.3. Comparisons ing involves fine-tuning a stable full precision model or re-
Pruning techniques are diverse and difficult to compare. training the quantized model. During this process real-valued
Shrinkbench [24] is a unified benchmark framework aiming weights are often down-scaled to integer values - typically
to provide pruning performance comparisons. 8-bits [120]. Saturated quantization can be used to generate
There exist ambiguities about the value of the pre-trained feature scales using a calibratation algorithm with a calibra-
weights. Liu [160] argues that the pruned model could be tion set. Quantized activations show similar distributions
trained from scratch using a random weight initialization. with previous real-valued data [173]. Kullback-Leibler di-

T Liang et al.: Preprint submitted to Elsevier Page 14 of 41

Survey on pruning and quantization

cluster/ post train quantize-aware quantize-aware

sharing quantize training training
weights
activation
calibrated
floating point floating point non-saturated
saturated

Figure 12: Quantization Evolution: The development of quantization techniques, from left to right. Purple rectangles indicated
quantized data while blue rectangles represent full precision 32-bit floating point format.

vergence (KL-divergence, also known as relative entropy or There are many methods to quantize a given network. Gener-
information divergence) calibrated quantization is typically ally, they are formulated as Equation 12 where 𝑠 is a scalar
applied and can accelerate the network without accuracy loss that can be calculated using various methods. 𝑔(⋅) is the
for many well known models [173]. Fine-tuning can also be clamp function applied to floating-point values 𝐗𝑟 perform-
applied with this approach. ing the quantization. 𝑧 is the zero-point to adjust the true
KL-divergence is a measure to show the relative entropy zero in some asymmetrical quantization approaches. 𝑓 (⋅) is
of probability distributions between two sets. Equation 11 the rounding function. This section introduces quantization
gives the equation for KL-divergence. 𝑃 and 𝑄 are defined using the mathematical framework of Equation 12.
as discrete probability distributions on the same probability
space. Specifically, 𝑃 is the original data (floating-point)
distribution that falls in several bins. 𝑄 is the quantized data 𝑐𝑙𝑎𝑚𝑝(𝑥, 𝛼, 𝛽) = 𝑚𝑎𝑥(𝑚𝑖𝑛(𝑥, 𝛽), 𝛼) (13)
histogram. Equation 13 defines a clamp function. The min-max
( ) method is given by Equation 14 where [𝑚, 𝑀] are the bounds
∑
𝑁
𝑃 (𝑥𝑖 )
𝐷KL (𝑃 ‖𝑄) = 𝑃 (𝑥𝑖 ) log (11) for the minimum and maximum values of the parameters, re-
𝑖=0
𝑄(𝑥𝑖 ) spectively. 𝑛 is the maximum representable number derived
from the bit-width (e.g., 256 = 28 in case of 8-bit), and 𝑧, 𝑠
Depending upon the processor and execution environ- are the same as in Equation 12. 𝑧 is typically non-zero in the
ment, quantized parameters can often accelerate neural net- min-max method [120].
work inference.
Quantization research can be categorized into two focus 𝑔(𝑥) = 𝑐𝑙𝑎𝑚𝑝(𝑥, 𝑚, 𝑀)
areas: 1) quantization aware training (QAT) and 2) post train- 𝑛−1 𝑚 × (1 − 𝑛)
ing quantization (PTQ). The difference depends on whether 𝑠= , 𝑧= (14)
𝑀 −𝑚 𝑀 −𝑚
training progress is is taken into account during training. Al- where 𝑚 = min{𝐗𝑖 }, 𝑀 = max{𝐗𝑖 }
ternatively, we could also categorize quantization by where
data is grouped for quantization: 1) layer-wise and 2) channel- The max-abs method uses a symmetry bound shown in
wise. Further, while evaluating parameter widths, we could Equation 15. The quantization scale 𝑠 is calculated from
further classify by length: N-bit quantization. the largest one 𝑅 among the data to be quantized. Since the
Reduced precision techniques do not always achieve the bound is symmetrical, the zero point 𝑧 will be zero. In such
expected speedup. For example, INT8 inference doesn’t a situation, the overhead of computing an offset-involved
achieve exactly 4× speedup over 32-bit floating point due convolution will be reduced but the dynamic range is reduced
to the additional operations of quantization and dequanti- since the valid range is narrower. This is especially noticeable
zation. For instance, Google’s TensorFlow-Lite [227] and for ReLU activated data where all of which values fall on the
nVidia’s Tensor RT [173] INT8 inference speedup is about positive axis.
2-3×. Batch size is the capability to process more than one
image in the forward pass. Using larger batch sizes, Tensor 𝑔(𝑥) = 𝑐𝑙𝑎𝑚𝑝(𝑥, −𝑀, 𝑀)
RT does achieve 3-4× acceleration with INT8 [173]. 𝑠=
𝑛−1
, 𝑧=0 (15)
Section 8 summarizes current quantization techniques 𝑅
used on the ILSVRC-2012 dataset along with their bit-widths where 𝑅 = max{𝑎𝑏𝑠{𝐗𝑖 }}
for weights and activation.
Quantization can be applied on input features 𝐅, weights
4.1. Quantization Algebra 𝐖, and biases 𝐛. Taking feature 𝐅 and weights 𝐖 as an
example (ignoring the biases) and using the min-max method
gives Equation 16. The subscripts 𝑟 and 𝑞 denote the real-
𝐗𝑞 = 𝑓 (𝑠 × 𝑔(𝐗𝑟 ) + 𝑧) (12) valued and quantized data, respectively. The 𝑚𝑎𝑥 suffix is

T Liang et al.: Preprint submitted to Elsevier Page 15 of 41

Survey on pruning and quantization

from 𝑅 in Equation 15, while 𝑠𝑓 = (𝑛 − 1)∕𝐹𝑚𝑎𝑥 , 𝑠𝑤 = 4.2. Quantization Methodology

(𝑛 − 1)∕𝑊𝑚𝑎𝑥 . We describe PTQ and QAT quantization approaches based
on back-propagation use. We can also categorize them based
𝑛−1 𝑛−1
𝐅𝑞 = × 𝐅𝑟 , 𝐖𝑞 = × 𝐖𝑟 (16) on bit-width. In the following subsections, we introduce com-
𝐹𝑚𝑎𝑥 𝑊𝑚𝑎𝑥 mon quantization methods. In Section 4.2.1 low bit-width
Integer quantized convolution is shown in Equation 17 quantization is discussed. In Section 4.2.2 and Section 4.2.3
and follows the same form as convolution with real values. In special cases of low bit-width quantization is discussed. In
Equation 17, the ∗ denotes the convolution operation, 𝐅 the Section 4.2.5 difficulties with training quantized networks
feature, 𝐖 the weights, and 𝐎𝑞 , the quantized convolution are discussed. Finally, in Section 4.2.4, alternate approached
result. Numerous third party libraries support this type of in- to quantization are discussed.
teger quantized convolution acceleration. They are discussed
in Section 4.3.2. 4.2.1. Lower Numerical Precision
Half precision floating point (16-bit floating-point, FP16)
𝐎𝑞 = 𝐅𝑞 ∗ 𝐖𝑞 s.t. 𝐅, 𝐖 ∈ ℤ (17) has been widely used in nVidia GPUs and ASIC accelerators
with minimal accuracy loss [54]. Mixed precision training
De-quantizing converts the quantized value 𝐎𝑞 back to with weights, activations, and gradients using FP16 while
floating-point 𝐎𝑟 using the feature scales 𝑠𝑓 and weights the accumulated error for updating weights remains in FP32
scales 𝑠𝑤 . A symmetric example with 𝑧 = 0 is shown in Equa- has shown SOTA performance - sometimes even improved
tion 18. This is useful for layers that process floating-point performance [172].
tensors. Quantization libraries are discussed in Section 4.3.2. Researchers [165, 98, 233] have shown that FP32 parame-
ters produced during training can be reduced to 8-bit integers
𝐎𝑞 𝐹𝑚𝑎𝑥 𝑊𝑚𝑎𝑥 for inference without significant loss of accuracy. Jacob [120]
𝐎𝑟 = = 𝐎𝑞 × × (18) applied 8-bit integers for both training and inference, with an
𝑠𝑓 × 𝑠𝑤 (𝑛 − 1) (𝑛 − 1)
accuracy loss of 1.5% on ResNet-50. Xilinx [212] showed
In most circumstances, consecutive layers can compute that 8-bit numerical precision could also achieve lossless per-
with quantized parameters. This allows dequantization to formance with only one batch inference to adjust quantization
be merged in one operation as in Equation 19. 𝐅𝑙+1
𝑞 is the parameters and without retraining.
quantized feature for next layer and 𝑠𝑙+1 is the feature scale Quantization can be considered an exhaustive search op-
𝑓
for next layer. timizing the scale found to reduce an error term. Given a
floating-point network, the quantizer will take an initial scale,
𝐎𝑞 × 𝑠𝑙+1
𝑓 typically calculated by minimizing the 𝑙2 -error, and use it
𝐅𝑙+1
𝑞 = (19) to quantize the first layer weights. Then the quantizer will
𝑠𝑓 × 𝑠𝑤
adjust the scale to find the lowest output error. It performans
The activation function can be placed following either this operation on every layer.
the quantized output 𝐎𝑞 , the de-quantized output 𝐎𝑟 , or after Integer Arithmetic-only Inference (IAI) [120] proposed
a re-quantized output 𝐅𝑙+1
𝑞 . The different locations may lead
a practical quantization scheme able to be adopted by indus-
to different numerical outcomes since they typically have try using standard datatypes. IAI trades off accuracy and
different precision. inference latency by compressing compact networks into in-
Similar to convolutional layers, FCLs can also be quan- tegers. Previous techniques only compressed the weights of
tized. K-means clustering can be used to aid in the compres- redundant networks resulting in better storage efficiency. IAI
sion of weights. In 2014 Gong [76] used k-means clustering quantizes 𝑧 ≠ 0 in Equation 12 requiring additional zero-
on FCLs and achieved a compression ratio of more than 20× point handling but resulting in higher efficiency by making
with 1% top-5 accuracy loss. use of unsigned 8-bit integers. The data-flow is described in
Bias terms in neural networks introduce intercepts in Figure 13. TensorFlow-Lite [120, 131] deployed IAI with
linear equations. They are typically regarded as constants an accuracy loss of 2.1% using ResNet-150 on the ImageNet
that help the network to train and best fit given data. Bias dataset. This is described in more detail in Section 4.3.2.
quantization is not widely mentioned in the literature. [120] Datatypes other than INT8 have been used to quantize
maintained 32-bit biases while quantizing weights to 8-bit. parameters. Fixed point, where the radix point is not at the
Since biases account for minimal memory usage (e.g. 12 right-most binary digit, is one format that has been found to be
values for a 10-in/12-out FCL vs 120 weight values) it is useful. It provides little loss or even higher accuracy but with
recommended to leave biases in full precision. If bias quan- a lower computation budget. Dynamic scaled fixed-point
tization is performed it can be a multiplication by both the representation [233] obtained a 4× acceleration on CPUs.
feature scale and weight scale [120], as shown in Equation 20. However, it requires specialized hardware including 16-bit
However, in some circumstances they may have their own fixed-point [89], 16-bit flex point [130], and 12-bit opera-
scale factor. For example, when the bit-lengths are limited to tions using dynamic fixed-point format (DFXP) [51]. The
be shorter than the multiplication results. specialized hardware is mentioned in Section 4.3.3.

𝑠𝑏 = 𝑠𝑤 × 𝑠𝑓 , 𝐛𝑞 = 𝐛𝑟 × 𝑠𝑏 (20)

T Liang et al.: Preprint submitted to Elsevier Page 16 of 41

Survey on pruning and quantization

biases
tion 21. 𝑖𝑑𝑥𝑖 (𝑛) is the index for the 𝑖𝑡ℎ weights in the 𝑛𝑡ℎ
weights uint32 code-book. Each coded weight 𝑤𝑖 can be indexed by the
uint8
activation
NB-bit expression.
conv + ReLU6
uint8
feature
uint8
∑
𝑁
[ ]
𝑤𝑖 = 𝐂𝑛 idx𝑖 (𝑛)
𝑛=1
Figure 13: Integer Arithmetic-only Inference: The convolution { } (21)
operation takes unsigned int8 weights and inputs, accumulates 𝐂𝑛 = 0, ±2−𝑛+1 , ±2−𝑛 , ±2−𝑛−1 , … , ±2−𝑛−⌊𝑀∕2⌋+2
them to unsigned int32, and then performs a 32-bit addi-
tion with biases. The ReLU6 operation outputs 8-bit integers. where 𝑀 = 2𝐵 − 1
Adopted from [120]
Note that the number of code-books 𝐶𝑛 can be greater than
one. This means the encoded weight might be a combination
of multiple shift operations. This property allows ShiftCNN
4.2.2. Logarithmic Quantization to expand to a relatively large-scale quantization or to shrink
Bit-shift operations are inexpensive to implement in hard- to binarized or ternary weights. We discuss ternary weights in
ware compared to multiplication operations. FPGA imple- Section 4.2.3. ShiftCNN was deployed on an FPGA platform
mentations [6] specifically benefit by converting floating- and achieved comparable accuracy on the ImageNet dataset
point multiplication into bit shifts. Network inference can with 75% power saving and up to 1090× clock cycle speed-up.
be further optimized if weights are also constrained to be ShiftCNN achieves this impressive result without requiring re-
power-of-two with variable-length encoding. Logarithmic training. With 𝑁 = 2 and 𝐵 = 4 encoding, SqueezeNet [115]
quantization takes advantage of this by being able to express has only 1.01% top-1 accuracy loss. The loss for GoogLeNet,
a larger dynamic range compared to linear quantization. ResNet-18, and ResNet-50 is 0.39%, 0.54%, and 0.67%, re-
Inspired by binarized networks [52], introduced in Sec- spectively, While compressing the weights into 7/32 of the
tion 4.2.3, Lin [156] forced the neuron output into a power- original size. This implies that the weights have significant
of-two value. This converts multiplications into bit-shift redundancy.
operations by quantizing the representations at each layer of Based on LogNN, Cai [30] proposed improvements by
the binarized network. Both training and inference time are disabling activation quantization to reduce overhead during
thus reduced by eliminating multiplications. inference. This also reduced the clamp bound hyperparameter
Incremental Network Quantization (INQ) [269] replaces tuning during training. These changes resulted in many low-
weights with power-of-two values. This reduces computa- valued weights that are rounded to the nearest value during
tion time by converting multiplies into shifts. INQ weight encoding. As 2𝑛 s.t. 𝑛 ∈ 𝑁 increases quantized weights
quantization is performed iteratively. In one iteration, weight sparsity as 𝑛 increases. In this research, 𝑛 is allowed to be
pruning-inspired weight partitioning is performed using group- real-valued numbers as 𝑛 ∈ 𝑅 to quantize the weights. This
wise quantization. These weights are then fine-tuned by using makes weight quantization more complex. However, a code-
a pruning-like measurement [92, 88]. Group-wise retraining book helps to reduce the complexity.
fine-tunes a subset of weights in full precision to preserve In 2019, Huawei proposed DeepShift, a method of sav-
ensemble accuracy. The other weights are converted into ing computing power by shift convolution [62]. DeepShift
power-of-two format. After multiple iterations most of the removed all floating-point multiply operations and replaced
full precision weights are converted to power-of-two. The them with bit reverse and bit shift. The quantized weight
final networks have weights from 2 (ternary) to 5 bits with 𝑊𝑞 transformation is shown mathematically in Equation 22,
values near zero set to zero. Results of group-wise iterative where 𝑆 is a sign matrix, 𝑃 is a shift matrix, and 𝑍 is the set
quantization show lower error rates than a random power-of- of integers.
two strategy. Specifically, INQ obtained 71× compression
with 0.52% top-1 accuracy loss on the ILSVRC-2012 with 𝑊𝑞 = 𝑆 × 2𝑃 , s.t. 𝑃 ∈ ℤ, 𝑆 ∈ {−1, 0, +1} (22)
AlexNet.
Logarithmic Neural Networks (LogNN) [175] quantize Results indicate that DeepShift networks cannot be easily
weights and features into a log-based representation. Loga- trained from scratch. They also show that shift-format net-
rithmic backpropagation during training is performed using works do not directly learn for lager datasets such as Im-
shift operations. Bases other than 𝑙𝑜𝑔2 can be used. 𝑙𝑜𝑔√2 agenet. Similar to INQ, they show that fine-tuning a pre-
trained network can improve performance. For example,
based arithmetic is described as a trade-off between dynamic
with the same configuration of 32-bit activations and 6-bit
range and representation precision. 𝑙𝑜𝑔2 showed 7× compres-
shift-format weights, the top-1 ILSVRC-2012 accuracy loss
sion with 6.2% top-5 accuracy loss on AlexNet, while 𝑙𝑜𝑔√2
on ResNet-18 for trained from scratch and tuned from a pre-
showed 1.7% top-5 accuracy loss. trained model are 4.48% and 1.09%, respectively.
Shift convolutional neural networks (ShiftCNN) [84] im- DeepShift proposes models with differential backpropa-
prove efficiency by quantizing and decomposing the real- gation for generating shift coefficients during the retraining
valued weights matrix into an 𝑁 times 𝐵 ranged bit-shift, process. DeepShift-Q [62] is trained with floating-point pa-
and encoding them with code-books 𝐂 as shown in Equa- rameters in backpropagation with values rounded to a suitable

T Liang et al.: Preprint submitted to Elsevier Page 17 of 41

Survey on pruning and quantization

format during inference. DeepShift-PS directly adopts the 𝑥𝑏 with a hard sigmoid probability 𝜎(𝑥). Both the activations
shift 𝑃 and sign 𝑆 parameters as trainable parameters. and the gradients use 32-bit single precision floating point.
Since logarithmic encoding has larger dynamic range, The trained BC network shows 1.18% classification error
redundant networks particularly benefit. However, less redun- on the small MNIST dataset but 8.27% classification on the
dant networks show significant accuracy loss. For example, larger CIFAR-10 dataset.
VGG-16 which is a redundant network shows 1.31% accuracy {
loss on top-1 while DenseNet-121 shows 4.02% loss. +1, with probability 𝑝 = 𝜎(𝑥)
𝑥𝑏 =
−1, with probability 1 − 𝑝 (25)
4.2.3. Plus-minus Quantization ( )
𝑥+1
Plus-minus quantization was in 1990 [208]. This tech- where 𝜎(𝑥) = clamp , 0, 1
2
nique reduces all weights to 1-bit representations. Similar
to logarithmic quantization, expensive multiplications are Courbariaux extended BC networks by binarizing the
removed. In this section, we provide an overview of signifi- activations. He named them BinaryNets [53], which is recog-
cant binarized network results. Simons [216] and Qin [198] nized as the first BNN. They also report a customized binary
provide an in-depth review of BNNs. matrix multiplication GPU kernel that accelerates the calcu-
Binarized neural networks (BNN) have only 1-bit weights lation by 7×. BNN is considered the first binarized neural
and often 1-bit activations. 0 and 1 are encoded to represent network where both weights and activations are quantized
-1 and +1, respectively. Convolutions can be separated into to binary values [216]. Considering the hardware cost of
multiplies and additions. In binary arithmetic, single bit stochastic binarization, they made a trade-off to apply deter-
operations can be performed using and, xnor, and bit-count. ministic binarization in most circumstances. BNN reported
We follow the introduction from [273] to explain bit-wise 0.86% error on MNIST, 2.53% error on SVHN, and 10.15%
operation. Single bit fixed point dot products are calculated error on CIFAR-10. The ILSVRC-2012 dataset accuracy
as in Equation 23, where and is a bit-wise AND operation results for binarized AlexNet and GoogleNet are 36.1% top-1
and bitcount counts the number of 1’s in the bit string. and 47.1%, respectively while the FP32 original networks
achieve 57% and 68%, respectively [112].
𝒙 ⋅ 𝒚 = bitcount(and(𝒙, 𝒚)), s.t. ∀𝑖, 𝑥𝑖 , 𝑦𝑖 ∈ {0, 1} (23) Rastegari [200] explored binary weight networks (BWN)
on the ILSVRC dataset with AlexNet and achieved the same
This can be extended into multi-bit computations as in Equa- classification accuracy as the single precision version. The
tion 24 [53]. 𝒙 and 𝒚 are M-bit and K-bit fixed point inte- key is a scaling factor 𝛼 ∈ ℝ+ applied to an entire layer of
∑ ∑𝐾−1
gers, subject to 𝒙 = 𝑀−1
𝑚=0 𝑐𝑚 (𝒙)2 and 𝒚 =
𝑚
𝑘=0 𝑐𝑘 (𝒚)2
𝑘 binarized weights 𝐁. This results in similar weights values
, where (𝑐𝑚 (𝒙))𝑀−1 and (𝑐𝑘 (𝒚))𝐾−1 are bit vectors. as if they were computed using FP32 𝐖 ≈ 𝛼𝐁. They also
𝑚=0 𝑘=0
applied weight binarization on ResNet-18 and GoogLeNet,
∑ 𝐾−1
𝑀−1 ∑ [ ( )] resulting in 9.5% and 5.8% top-1 accuracy loss compared
x⋅y = 2𝑚+𝑘 bitcount and 𝑐𝑚 (x), 𝑐𝑘 (y) , to the FP32 version, respectively. They also extended bina-
(24)
𝑚=0 𝑘=0 rization to activations called XNOR-Net and evaluated it on
s.t. 𝑐𝑚 (x)𝑖 , 𝑐𝑘 (y)𝑖 ∈ {0, 1}∀𝑖, 𝑚, 𝑘. the large ILSVRC-2012 dataset. Compared to BNN, XNOR-
Net also applied a scaling factor on the input feature and a
By removing complicated floating-point multiplications, rearrangement of the network structure (swapping the con-
networks are dramatically simplified with simple accumula- volution, activation, and BN). Finally, XNOR-Net achieved
tion hardware. Binarization not only reduces the network size 44.2% top-1 classification accuracy on ILSVRC-2012 with
by up-to 32×, but also drastically reduces memory usage re- AlexNet, while accelerating execution time 58× on CPUs.
sulting in significantly lower energy consumption [174, 112]. The attached scaling factor extended the binarized value ex-
However, reducing 32-bit parameters into a single bit results pression, which reduced the network distortion and lead to
in a significant loss of information, which decreases predic- better ImageNet accuracy.
tion accuracy. Most quantized binary networks significantly DoReFa-Net [272] also adopts plus-minus arithmetic for
under-perform compared to 32-bit competitors. quantized network. DoReFa additionally quantizes gradients
There are two primary methods to reduce floating-point to low-bit widths within 8-bit expressions during the back-
values into a single bit: 1) stochastic and 2) deterministic [52]. ward pass. The gradients are quantized stochastically in back
Stochastic methods consider global statistics or the value of propagation. For example, it takes 1 bit to represent weights
input data to determine the probability of some parameter to layer-wise, 2-bit activations, and 6-bits for gradients. We
be -1 or +1. Deterministic binarization directly computes describe training details in Section 4.2.5. They found 9.8%
the bit value based on a threshold, usually 0, resulting in a top-1 accuracy loss on AlexNet with ILSVRC-2012 using
sign function. Deterministic binarization is much simpler to the 1-2-6 combination. The result for the 1-4-32 combination
implement in hardware. is 2.9%.
Binary Connect (BC), proposed by Courbariaux [52], Li [146] and Leng [144] showed that for ternary weights
is an early stochastic approach to binarize neural networks. (−1, 0, and + 1), in Ternary Weight Networks (TWN), only
They binarized the weights both in forward and backward a slight accuracy loss was realized. Compared to BNN, TWN
propagation. Equation 25 shows the stochastic binarization has an additional value to reduce information loss while still

T Liang et al.: Preprint submitted to Elsevier Page 18 of 41

Survey on pruning and quantization

keeping computational complexity similar to BNN’s. Ternary measure problem

logic may be implemented very efficiently in hardware, as
𝑐𝑖𝑛
the additional value (zero) do not actually participate in com- ∑
𝑑 ∑
𝑑 ∑
𝑌 (𝑚, 𝑛, 𝑡) = 𝑆(𝐗(𝑚+𝑖, 𝑛+𝑗, 𝑘), 𝐅(𝑖, 𝑗, 𝑘, 𝑡)) (26)
putations [50]. TWN adopts the 𝑙2 -distance to find the scale
𝑖=0 𝑗=0 𝑘=0
and formats the weights into −1, 0, and + 1 with a threshold
generated by an assumption that the weighs are uniformly where 𝐅 ∈ ℝ𝑑×𝑑×𝑐𝑖𝑛 ×𝑐out is a filter, 𝑑 is the kernel size, 𝑐𝑖𝑛 is
distributed such as in [−𝑎, 𝑎]. This resulted in up to 16× an input channel and 𝑐out is an output channel. 𝐗 ∈ ℝℎ×𝑤×𝑐𝑖𝑛
model compression with 3.6% ResNet-18 top-1 accuracy loss stands for the input feature height ℎ and width 𝑤. With this
on ILSVRC-2012. formulation, the output 𝑌 is calculated with the similarity
Trained Ternary Quantization (TTQ) [274] extended TWN 𝑆(⋅, ⋅), i.e., 𝑆(𝑥, 𝑦) = 𝑥 × 𝑦 for conventional convolution
by introducing two dynamic constraints to adjust the quantiza- where the similarity measure is calculated by cross correla-
tion threshold. TTQ outperformed the full precision AlexNet tion. Equation 27 mathematically describes AdderNet, which
on the ILSVRC-2012 top-1 classification accuracy by 0.3%. replaces the multiply with subtraction. The 𝑙1 -distance is
It also outperformed TWN by 3%. applied to calculate the distance between the filter and the
Ternary Neural Networks (TNN) [6] extend TWN by input feature. By replacing multiplications with subtractions,
quantizing the activations into ternary values. A teacher net- AdderNet speeds up inference by transforming 3.9 billion
work is trained with full precision and then using transfer multiplications into subtractions with a loss in ResNet-50
learning the same structure is used but replacing the full accuracy of 1.3%.
precision values with a ternarized student in a layer-wise
greedy method. A small difference between the real-valued 𝑑 ∑
∑ 𝑐𝑖𝑛
𝑑 ∑
teacher network and the ternarized student network is that 𝑌 (𝑚, 𝑛, 𝑡) = − |𝐗(𝑚 + 𝑖, 𝑛 + 𝑗, 𝑘) − 𝐅(𝑖, 𝑗, 𝑘, 𝑡)|
they activate the output with a ternary output activation func- 𝑖=0 𝑗=0 𝑘=0
tion to simulate the real TNN output. TNN achieves 1.67% (27)
MNIST classification error and 12.11% classification error
on CIFAR10. TNN has slightly lower accuracy compared to NAS can be applied to BNN construction. Shen [213]
TWN (an additional 1.02% MNIST error). adopted evolutionary algorithms to find compact but accurate
Intel proposed Fine-Grained Quantization (FGQ) [170] models achieving 69.65% top-1 accuracy on ResNet-18 with
to generalize ternary weights by splitting them into several ImageNet at 2.8× speed-up. This is better performance than
groups and with independent ternary values. The FGQ quan- the 32-bit single precision baseline ResNet-18 accuracy of
tized ResNet-101 network achieved 73.85% top-1 accuracy on 69.6%. However, the search approach is time consuming
the ImageNet dataset (compared with 77.5% for the baseline) taking 1440 hours on an nVidia V100 GPU to search 50k
using four groups weights and without re-training. FGQ also ImageNet images to process an initial network.
showed improvements in (re)training demonstrating a top-1
accuracy improvement from 48% on non-trained to 71.1% 4.2.4. Other Approaches to Quantization
top-1 on ResNet-50. ResNet-50’s baseline accuracy is 75%. Weight sharing by vector quantization can also be consid-
Four groups FGQ with ternary weights and low bit-width ered a type of quantization. In order to compress parameters
activations achieves about 9× acceleration. to reduce memory space usage, parameters can be clustered
MeliusNet [21] is a binary neural network that consist and shared. K-means is a widely used clustering algorithm
of two types of binary blocks. To mitigate drawbacks of and has been successfully applied to DNNs with minimal
low bit width networks, reduced information quality, and loss of accuracy [76, 243, 143] achieving 16-24 times com-
reduced network capacity, MeliusNet used a combination pression with 1% accuracy loss on the ILSVRC-2012 dataset
of dense block [22] which increases network channels by [76, 243].
concatenating derived channels from the input to improve HashNet [37] uses a hash to cluster weights. Each hash
capacity and improvement block [161] which improves the group is replaced with a single floating-point weight value.
quality of features by adding additional convolutional acti- This was applied to FCLs and shallow CNN models. They
vations onto existing extra channels from dense block. They found a compression factor of 64× outperforms equivalent-
achieved accuracy results comparable to MobileNet on the sized networks on MNIST and seven other datasets they eval-
ImageNet dataset with MeliusNet-59 reporting 70.7% top- uated.
1 accuracy while requiring only 0.532 BFLOPs. A similar In 2016 Han applied Huffman coding with Deep Com-
sized 17MB MobileNet required 0.569 BFLOPs achieving pression [92]. The combination of weight sharing, pruning,
70.6% accuracy. and huffman coding achieved 49× compression on VGG-16
AdderNet [35] is another technique that replaces multiply with no loss of accuracy on ILSVRC-2012, which was SOTA
arithmetic but allows larger than 1-bit parameters. It replaces at the time.
all convolutions with addition. Equation 26 shows that for a The Hessian method was applied to measure the impor-
standard convolution, AdderNet formulates it as a similarity tance of network parameters and therefore improve weight
quantization [45]. They minimized the average Hessian
weighted quantization errors to cluster parameters. They
found compression ratios of 40.65 on AlexNet with 0.94%

T Liang et al.: Preprint submitted to Elsevier Page 19 of 41

Survey on pruning and quantization

accuracy loss on ILSVRC-2012. Weight regularization can plied for propagating gradients by using discretization [112].
slightly improve the accuracy of quantized networks by pe- Equation 28 show the STE for sign binarization, where 𝑐
nalizing weights with large magnitudes [215]. Experiments denotes the cost function, 𝑤𝑟 is the real-valued weights, and
showed that 𝑙2 regularization improved 8-bit quantized Mo- 𝑤𝑏 is the binarized weight produced by the sign function.
bileNet top-1 accuracy by 0.23% on ILSVRC-2012. STE bypasses the binarization function to directly calculate
BN has proved to have many advantages including ad- real-valued gradients. The floating-point weights are then up-
dressing the internal covariate shift issue [119]. It can also dated using methods like SGD. To avoid real-valued weights
be considered a type of quantization. However, quantization approaching infinity, BNNs typically clamp floating-point
performed with BN may have numerical instabilities. The weights to the desired range of ±1 [112].
BN layer has nonlinear square and square root operations. ( )
Low bit representations may be problematic when using non- Forward : 𝑤𝑏 = sign 𝑤𝑟
linear operations. To solve this, 𝑙1 -norm BN [245] has only 𝜕𝑐 𝜕𝑐 (28)
Backward : = 𝟏|𝑤𝑟 |≤1
linear operations in both forward and backward training. It 𝜕𝑤𝑟 𝜕𝑤𝑏
provided 1.5× speedup at half the power on FPGA platforms
and can be used with both training and inference. Unlike the forward phase where weights and activations
are produced with deterministic quantization, in the gradient
4.2.5. Quantization-aware Training phase, the low bit gradients should be generated by stochas-
Most quantization methods use a global (layer-wise) quan- tic quantization [89, 271]. DoReFa [272] first successfully
tization to reduce the full precision model into a reduced bit trained a network with gradient bit-widths less than eight and
model. Thus can result in non-negligible accuracy loss. A sig- achieved a comparable result with 𝑘-bit quantization arith-
nificant drawback of quantization is information loss caused metic. This low bit-width gradient scheme could accelerate
by the irreversible precision reducing transform. Accuracy training in edge devices with little impact to network accu-
loss is particularly visible in binary networks and shallow net- racy but minimal inference acceleration compared to BNNs.
works. Applying binary weights and activations to ResNet-34 DoReFa quantizes the weights, features, and gradients into
or GoogLeNet resulted in 29.10% and 24.20% accuracy loss, many levels obtaining a larger dynamic range than BNNs.
respectively [53]. It has been shown that backward propaga- They trained AlexNet on ImageNet from scratch with 1-bit
tion fine-tunes (retrains) a quantized network and can recover weights, 2-bit activations, and 6-bit gradients. They obtained
losses in accuracy caused by the quantization process [171]. 46.1% top-1 accuracy (9.8% loss comparing with the full
The retraining is even resilient to binarization information precision counterpart). Equation 29 shows the weight quan-
distortions. Thus training algorithms play a crucial role when tizing approach. 𝑤 is the weights (the same as in Equation 28),
using quantization. In this section, we introduce (re)training limit is a limit function applied to the weights keeping them
of quantized networks. in the range of [0, 1], and quantize𝑘 quantizes the weights
into 𝑘-levels. Feature quantization is performed using the
BNN Training: For a binarized network that has binary val- 𝑓𝛼𝑘 = quantize𝑘 function.
ued weights it is not effective to update the weights using ( )
gradient decent methods due to typically small derivatives. 𝑓𝑤𝑘 = 2 quantize𝑘 limit(𝑤𝑟 ) − 1
Early quantized networks were trained with a variation of 1 (( ) )
Bayesian inference named Expectation Back Propagation where quantize𝑘 (𝑤𝑟 ) = 𝑘 round 2𝑘 − 1 𝑤𝑟 , (29)
2 −1
(EBP) [220, 41]. This method assigns limited parameter pre- tanh(𝑥) 1
cision (e.g., binarized) weights and activations. EBP infers and limit(𝑥) = +
2 max(| tanh(𝑥)|) 2
networks with quantized weights by updating the posterior
distributions over the weights. The posterior distributions are In DoReFa, gradient quantization is shown in Equation 30,
updated by differentiating the parameters of the backpropa- where d𝑟 = 𝜕𝑐∕𝜕𝑟 is the backprogagated gradient of the cost
gation. function 𝑐 to output 𝑟.
BinaryConnect [52] adopted the probabilistic idea of
[ ( ) ]
EBP but instead of optimizing the weights posterior distri- d𝑟 1 1
bution, BC preserved floating-point weights for updates and 𝑓̃𝛾𝑘 = 2 max0 (|d𝑟|) quantize𝑘 + −
2 max0 (|d𝑟|) 2 2
then quantized them into binary values. The real-valued (30)
weights update using the back propagated error by simply
ignoring the binarization in the update. As in deep feed forward networks, the exploding gradi-
A binarized Network has only 1-bit parameters - ±1 quan- ent problem can cause BNN’s not to train. To address this
tized from a sign function. Single bit parameters are non- issue, Hou [104] formulated the binarization effect on the net-
differentiable and therefore it is not possible to calculate gra- work loss as an optimization problem which was solved by a
dients needed for parameter updating [208]. SGD algorithms proximal Newton’s algorithm with diagonal Hessian approx-
have been shown to need 6 to 8 bits to be effective [180]. To imation that directly minimizes the loss with respect to the
work around these limitations the Straight-Through Estima- binary weights. This optimization found 0.09% improvement
tor (STE), previously introduced by Hinton [102], was ap- on MNIST dataset compared with BNN.

T Liang et al.: Preprint submitted to Elsevier Page 20 of 41

Survey on pruning and quantization

Alpha-Blending (AB) [162] was proposed as a replace- input layer

ment for STE. Since STE directly sets the quantization func-
tion gradients to 1, a hypothesis was made that STE tuned weights feature
networks could suffer accuracy losses. Figure 14 shows that FP32 FP32
AB introduces an additional scale coefficient 𝛼. Real-valued

F32
weights and quantized weights are both kept. During training
𝛼 is gradually raised to 1 until a fully quantized network is optimizer quantizer quantizer
loss scaling float2half float2half
realized.

F16
weights feature
weights weights
float FP16 FP16
float
STE AB
quantizer
quantizer
1-α
weights feature
weights feature
binary float conv
binary float
α
+

F16
F16
conv conv
activation
FP16

(a) Straight-through Estimator (b) Alpha-Blending Approach

Figure 15: Mixed Precision Training [172]: FP16 is applied

Figure 14: STE and AB: STE directly bypasses the quan- in the forward and backward pass, while FP32 weights are
tizer while AB calculates gradients for real-valued weights by maintained for the update.
introducing additional coefficients 𝛼 [162]

In Section 4.3.2, we discuss deep learning libraries and frame-

Low Numerical Precision Training: Training with low works. We introduce their specification in Table 2 and then
numerical precision involves taking the low precision values compare their performance in Table 3. We also discuss hard-
into both forward and backward propagation while maintain- ware implementations of DNNs in Section 4.3.3. Dedicated
ing the full precision accumulated results. Mixed Precision hardware is designed or programmed to support efficient pro-
[172, 54] training uses FP16 or 16-bit integer (INT16) for cessing of quantized networks. Specialized CPU and GPU
weight precision. This has been shown to be inaccurate for operations are discussed. Finally, in Section 4.3.4 we discuss
gradient values. As shown in Figure 15, full precision weights DNN compilers.
are maintained for gradient updating, while other operands
use half-float. A loss scaling technique is applied to keep very 4.3.1. Deployment Introduction
small magnitude gradients from affecting the computation With significant resource capability, large organizations
since any value less than 2−24 becomes zero in half-precision and institutions usually have their own proprietary solutions
[172]. Specifically, a scaler is introduced to the loss value for applications and heterogeneous platforms. Their support
before backpropagation. Typically, the scaler is a bit-shift to the quantization is either inference only or as well as train-
optimal value 2𝑛 obtained empirically or by statistical infor- ing. The frameworks don’t always follow the same idea of
mation. quantization. Therefore there are differences between them,
In TensorFlow-Lite [120], training proceeds with real so performs.
values while quantization effects are simulated in the for- With DNNs being applied in many application areas, the
ward pass. Real-valued parameters are quantized to lower issue of efficient use of hardware has received considerable
precision before convolutional layers. BN layers are folded attention. Multicore processors and accelerators have been
into convolution layers. More details are described in Sec- developed to accelerate DNN processing. Many types of
tion 4.3.2. accelerators have been deployed, including CPUs with in-
As in binarized networks, STE can also be applied to struction enhancements, GPUs, FPGAs, and specialized AI
reduced precision training such as 8-bit integers [131]. accelerators. Often accelerators are incorporated as part of a
heterogeneous system. A Heterogeneous System Architec-
4.3. Quantization Deployment ture (HSA) allows the different processors to integrate into
In this section, we describe implementations of quanti- a system to simultaneously access shared memory. For ex-
zation deployed in popular frameworks and hardware. In ample, CPUs and GPUs using cache coherent shared virtual
Section 4.3.1 we give an introduction to deployment issues. memory on the same System of Chip (SoC) or connected by

T Liang et al.: Preprint submitted to Elsevier Page 21 of 41

Survey on pruning and quantization

Table 2
Low Precision Libraries Using Quantization: QAT is quantization-aware training, PTQ is
post-training quantization, and offset indicates the zero point 𝑧 in Equation 12.
Name Institution Core Lib Precision Method Platform Open-sourced
ARM CMSIS NN [129] Arm CMSIS 8-bit deploy only Arm Cortex-M Processor No
MACE [247] XiaoMi - 8-bit QAT and PTQ Mobile - CPU, Hexagon Chips, MTK APU Yes
MKL-DNN [204] Intel - 8-bit PTQ, mixed offset, and QAT Intel AVX Core Yes
NCNN [229] Tencent - 8-bit PTQ w/o offset Mobile Platform Yes
Paddle [13] Baidu - 8-bit QAT and PTQ w/o offset Mobile Platform Yes
QNNPACK [61] Fackbook - 8-bit PTQ w/ offset Mobile Platform Yes
Ristretto [90] LEPS gemm 3 method QAT Desktop Platform Yes
SNPE [228] Qualcomm - 16/8-bit PTQ w/ offset, max-min Snapdragon CPU, GPU, DSP No
Tensor-RT [173] nVidia - 8-bit PTQ w/o offset nVidia GPU Yes
TF-Lite [1] Google gemmlowp 8-bit PTQ w/ offset Mobile Platform Yes

PCIe with platform atomics can share the same address space tation [173]. KL calibration can significantly help to avoid
[74]. Floating-point arithmetic units consume more energy accuracy loss.
and take longer to compute compared to integer arithmetic The method traverses a predefined possible range of scales
units. Consequently, low-bitwidth architectures are designed and calculates the KL-divergences for all the points. It then
to accelerate computation [179]. Specialized algorithms and selects the scale which minimizes the KL-divergence. KL-
efficient hardware can accelerate neural network processing divergence is widely used in many post training acceler-
during both training and inference [202]. ation frameworks. nVidia found a model calibrated with
125 images showed only 0.36% top-1 accuracy loss using
4.3.2. Efficient Kernels GoogLeNet on the Imagenet dataset.
Typically low precision inference in only executed on
convolutional layers. Intermediate values passed between Intel MKL-DNN [204] is an optimized computing library
layers use 32-bit floating-point. This makes many of the for Intel processors with Intel AVX-512, AVX-2, and SSE4.2
frameworks amenable to modifications. Instruction Set Architectures (ISA). The library uses FP32 for
Table 2 gives a list of major low precision acceleration training and inference. Inference can also be performed using
frameworks and libraries. Most of them use INT8 precision. 8-bits in convolutional layers, ReLU activations, and pool-
We will next describe some popular and open-source libraries ing layers. It also uses Winograd convolutions. MKL-DNN
in more detail. uses max-abs quantization shown in Equation 15, where the
feature adopts unsigned 8-bit integer 𝑛𝑓 = 256 and signed
Tensor RT [232, 242] is an nVidia developed C++ library 8-bit integer weights 𝑛𝑤 = 128. The rounding function 𝑓 (⋅)
that facilitates high-performance inference on NVIDIA GPUs. in Equation 12 uses nearest integer rounding. Equation 32
It is a low precision inference library that eliminates the bias shows the quantization applied on a given tensor or each
term in convolutional layers. It requires a calibration set to channel in a tensor. The maximum of weights 𝑅𝑤 and fea-
adjust the quantization thresholds for each layer or channel. tures 𝑅𝑓 is calculated from the maximum of the absolute
Afterwards the quantized parameters are represented by 32- value (nearest integer rounding) of the tensor 𝕋𝑓 and 𝕋𝑤 . The
bit floating-point scalar and INT8 weights. feature scale 𝑠𝑓 and weights scale 𝑠𝑤 are generated using 𝑅𝑤
Tensor RT takes a pre-trained floating-point model and and 𝑅𝑓 . Then quantized 8-bit signed integer weights 𝐖𝑠8 ,
generates a reusable optimized 8-bit integer or 16-bit half 8-bit unsigned integer feature 𝐅𝑢8 and 32-bit unsigned inte-
float model. The optimizer performs network profiling, layer ger biases 𝐁𝑢32 are generated using the scales and a nearest
fusion, memory management, and operation concurrency. rounding function ‖ ⋅ ‖.
Equation 31 shows the convolution-dequantization dataflow
in Tensor RT for 8-bit integers. The intermediate result of
convolution by INT8 input feature 𝐅𝑖8 and weights 𝐖𝑖8 are 𝑅{𝑓 ,𝑤} = 𝑚𝑎𝑥((𝑎𝑏𝑠(𝕋{𝑓 ,𝑤} ))
accumulated into INT32 tensor 𝐎𝑖32 . They are dequantized 255 127
𝑠𝑓 = , 𝑠𝑤 =
by dividing by the feature and weight scales 𝑠𝑓 , 𝑠𝑤 . 𝑅𝑓 𝑅𝑤
(32)
𝐖𝑠8 = ‖𝑠𝑤 × 𝐖𝑓 32 ‖ ∈ [−127, 127]
𝐎𝑖32 𝐅𝑢8 = ‖𝑠𝑓 × 𝐅𝑓 32 ‖ ∈ [0, 255]
𝐎𝑖32 = 𝐅𝑖8 ∗ 𝐖𝑖8 , 𝐎𝑓 32 = (31)
𝑠𝑓 × 𝑠𝑤
𝐁𝑠32 = ‖𝑠𝑓 × 𝑠𝑤 × 𝐁𝑓 32 ‖ ∈ [−231 , 231 − 1]
Tensor RT applies a variant of max-abs quantization to
An affine transformation using 8-bit multipliers and 32-
reduce storage requirements and calculation time of the zero
bit accumulates results in Equation 33 with the same scale
point term 𝑧 in Equation 15 by finding the proper thresh-
factors as defined in Equation 32 and ∗ denoting convolution.
old instead of the absolute value in the floating-point tensor.
KL-divergence is introduced to make a trade-off between
numerical dynamic range and precision of the INT8 represen-

T Liang et al.: Preprint submitted to Elsevier Page 22 of 41

Survey on pruning and quantization

It is an approximation since rounding is ignored. produce the quantized output 𝐎𝑞 . 𝐅, 𝐖, 𝐳 are the same as in
Equation 35.
𝐎𝑠32 = 𝐖𝑠8 ∗ 𝐅𝑢8 + 𝐛𝑠32
( ) 𝐎𝑞 = (𝐅𝑞 + 𝐳𝑓 × 𝑃 ) ∗ (𝐖𝑞 + 𝐳𝑤 × 𝑄)
≈ 𝑠𝑓 𝑠𝑤 𝐖𝑓 32 ∗ 𝐅𝑓 32 + 𝐛𝑓 32 (33)
= 𝑠𝑓 × 𝑠𝑤 × 𝐎𝑓 32 = 𝐅𝑞 ∗ 𝐖𝑞
+ 𝐳𝑓 × 𝐏 × 𝐖𝑞 (36)
Equation 34 is the affine transformation with FP32 format.
+ 𝐳𝑤 × 𝐐 × 𝐅𝑞
𝐷 is the dequantization factor.
+ 𝐳𝑓 × 𝐳𝑤 × 𝐏 × 𝐐

𝐎𝑓 32 = 𝐖𝑓 32 ∗ 𝐅𝑓 32 + 𝐛𝑓 32 Ristretto [90] is a tool for Caffe quantization. It uses re-

1 training to adjust the quantized parameters. Ristretto uses
≈ 𝐎 = 𝐷 × 𝐎𝑠32
𝑠𝑓 𝑠𝑤 𝑠32 (34) a three-part quantization strategy: 1) a modified fixed-point
1 format Dynamic Fixed Point (DFP) which permits the limited
where 𝐷 = bit-width precision to dynamically carry data, 2) bit-width
𝑠𝑓 𝑠𝑤
reduced floating-point numbers called mini float which fol-
Weight quantization is done prior to inference. Activation lows the IEEE-754 standard [219], and 3) integer power of
quantization factors are prepared by sampling the validation 2 weights that force parameters into power of 2 values to
dataset to find a suitable range (similar to Tensor RT). The replace multiplies with bit shift operations.
quantization factors can be either FP32 in the supported de- DPF is shown in Equation 37 where 𝑠 takes one sign bit,
vices, or rounded to the nearest power-of-two format to enable FL denotes the fractional length, and 𝑥 is the mantissa. The
bit-shifts. Rounding reduces accuracy by about 1%. total bit-width is 𝐵. This quantization can encode data from
MKL-DNN assumes activations are non-negative (ReLU various ranges to a proper format by adjusting the fractional
activated). Local Response Normalization (LRN), a function length.
to pick the local maximum in a local distribution, is used
to avoid over-fitting. BN, FCL, and soft-max using 8-bit
∑
𝐵−2
inference are not currently supported. (−1)𝑠 ⋅ 2-FL 2𝑖 ⋅ 𝑥𝑖 (37)
𝑖=0
TensorFlow-Lite (TF-Lite) [1] is an open source frame-
work by Google for performing inference on mobile or em- A bit shift convolution conversion is shown in Equa-
bedded devices. It consists of two sets of tools for converting tion 38. The convolution by input 𝐅𝑗 and weights 𝐖𝑗 and
and interpreting quantized networks. Both PTQ and QAT are bias 𝐛𝑖 are transformed into shift arithmetic by rounding the
available in TF-Lite. weights to the nearest power of 2 values. Power of 2 weights
GEMM low-precision (Gemmlowp) [78] is a Google open provides inference acceleration while dynamic fixed point
source gemm library for low precision calculations on mobile provides better accuracy.
and embedded devices. It is used in TF-Lite. Gemmlowp
uses asymmetric quantzation as shown in Equation 35 where ∑[ ]
𝐅, 𝐖, 𝐎 denotes feature, weights and output, respectively. 𝐎𝑖 = 𝐅𝑗 ⋅ 𝐖𝑗 + 𝐛𝑖
𝑗
𝑠𝑓 , 𝑠𝑤 are the scales for feature and weights, respectively. ∑[ ( ( ))] (38)
𝐅𝑓 32 is Feature value in 32-bit floating. Similarly, 𝐖𝑓 32 is ≈ 𝐅𝑗 ≪ round log2 𝐖𝑗 + 𝐛𝑖
the Weight value in 32-bit floating point. 𝐅𝑞 , 𝐖𝑞 are the 𝑗
quantized Features and Weights, respectively. Asymmetric
quantization introduces the zero points (𝐳𝑓 and 𝐳𝑤 ). This
NCNN [229] is a standalone framework from Tencent for ef-
ficient inference on mobile devices. Inspired by Ristretto and
produces a more accurate numerical encoding.
Tensor-RT, it works with multiple operating systems and sup-
ports low precision inference [28]. It performs channel-wise
𝐎𝑓 32 = 𝐅𝑓 32 ∗ 𝐖𝑓 32 quantization with KL calibration. The quantization results
in 0.04% top-1 accuracy loss on ILSVRC-2012. NCNN has
= 𝑠𝑓 × (𝐅𝑞 + 𝐳𝑓 ) ∗ 𝑠𝑤 × (𝐖𝑞 + 𝐳𝑤 ) (35)
implementations optimized for ARM NEON. NCNN also
= 𝑠𝑓 × 𝑠𝑤 × (𝐅𝑞 + 𝐳𝑓 ) ∗ (𝐖𝑞 + 𝐳𝑘 ) replaces 3 × 3 convolutions with simpler Winograd convolu-
tions [135].
The underlined part in Equation 35 is the most compu-
tationally intensive. In addition to the convolution, the zero Mobile AI Compute Engine (MACE) [247] from Xiaomi
point also requires calculation. Gemmlowp reduces many supports both post-training quantization and quantization-
multi-add operations by multiplying an all-ones matrix as aware training. Quantization-aware training is recommended
the bias matrix 𝑃 and 𝑄 in Equation 36. This allows four as it exhibits lower accuracy loss . Post-training quantiza-
multiplies to be dispatched in a three stage pipeline [131], to tion requires statistical information from activations collected

T Liang et al.: Preprint submitted to Elsevier Page 23 of 41

Survey on pruning and quantization

while performing inference. This is typically performed with the quantized or the quantize-dequantized weights.
batch calibration of input data. MACE also supports proces-
sor implementations optimized for ARM NEON and Qual- 𝐅𝑞 𝐖𝑞
𝐎𝑓 32 = ( × 𝐅𝑚𝑎𝑥 ) ∗ ( × 𝐖𝑚𝑎𝑥 ) (39)
comm’s Hexagon digital signal processor. OpenCL accelera- (𝑛 − 1) (𝑛 − 1)
tion is also supported. Winograd convolutions can be applied
Paddle uses max-abs in three ways to quantize parame-
for further acceleration as discussed in Section 4.2.2.
ters: 1) the average of the max absolute value in a calculation
Quantized Neural Network PACKage (QNNPACK) [61] window, 2) the max absolute value during a calculation win-
is a Facebook produced open-source library optimized for dow, and 3) a sliding average of the max absolute value of
edge computing especially for mobile low precision neural the window. The third method is described in Equation 40,
network inference. It has the same method of quantization as where 𝑉 is the max absolute value in the current batch, 𝑉𝑡 is
TF-Lite including using a zero-point. The library has been the average value of the sliding window, and 𝑘 is a coefficient
integrated into PyTorch [193] to provide users a high-level chosen by default as 0.9.
interface. In addition to Winograd and FFT convolution op- The Paddle framework uses a specialized toolset, Pad-
erations, the library has optimized gemm for cache indexing dleSlim, which supports Quantization, Pruning, Network
and feature packing. QNNPACK has a full compiled solution Architecture Search, and Knowledge Distilling. They found
for many mobile devices and has been deployed on millions 86.47% size reduction of ResNet-50, with 1.71% ILSVRC-
of devices with Facebook applications. 2012 top-1 accuracy loss.
Panel Dot product (PDOT) is a key feature of QNNPACK’s 𝑉𝑡 = (1 − 𝑘) × 𝑉 + 𝑘 × 𝑉𝑡−1 (40)
highly efficient gemm library. It assumes computing effi-
ciency is limited with memory, cache, and bandwidth instead
of Multiply and Accumulate (MAC) performance. PDOT
computes multiple dot products in parallel as shown in Fig- 4.3.3. Hardware Platforms
ure 16. Rather than loading just two operands per MAC Figure 17 shows AI chips, cards, and systems plotted
operation, PDOT loads multiple columns and rows. This im- by peak operations verses power in log scale originally pub-
proves convolution performance about 1.41× 2.23× speedup lished in [202]. Three normalizing lines are shown at 100
for MobileNet on mobile devices [61]. GOPS/Watt, 1 TOP/Watt, and 10 TOPs/Watt. Hardware plat-
forms are classified along several dimensions including: 1)
training or inference, 2) chip, card, or system form factors, 3)
datacenter or mobile, and 4) numerical precision. We focus
= × on low precision general and specialized hardware in this
section.

Programmable Hardware: Quantized networks with less

than 8-bits of precision are typically implemented in FPGAs
= × but may also be executed on general purpose processors.
BNN’s have been implemented on a Xilinx Zynq het-
erogeneous FPGA platform [267]. They have also been im-
Figure 16: PDOT: computing dot product for several points plemented on Intel Xeon CPUs and Intel Arria 10 FPGA
in parallel. heterogeneous platforms by dispatching bit operation to FP-
GAs and other operations to CPUs [178]. The heterogeneous
system shares the same memory address space. Training
Paddle [13] applies both QAT and PTQ quantization with is typically mapped to CPUs. FINN [231] is a specialized
using zero-points. The dequantization operation can be per- framework for BNN inference on FPGAs. It contains bi-
formed prior to convolution as shown in Equation 39. Pad- narized fully connected, convolutional, and pooling layers.
dle uses this feature to do floating-point gemm-based con- When deployed on a Zynq-7000 SoC, FINN has achieved
volutions with quantize-dequantized weights and features 12.36 million images per second on the MNIST dataset with
within the framework data-path. It introduces quantization 4.17% accuracy loss.
error while maintaining the data in format of floating-point. Binarized weights with 3-bit features have been imple-
This quantize-dequantize-convolution pipeline is called simu- mented on Xilinx Zynq FPGAs and Arm NEON processors
quantize and its results are approximately equal to a FP32- [196]. The first and last layer of the network use 8-bit quanti-
>INT8->Convolutional->FP32 (quantize - convolutional - ties but all other layers use binary weights and 3-bit activation
dequantize) three stage model. values. On an embedded platform, Zynq XCZU3EG, they
Simu-quantize maintains the data at each phase in 32- performed 16 images per second for inference. To accel-
bit floating-point facilitating backward propagation. In the erate Tiny-YOLO inference, significant efforts were taken
Paddle framework, during backpropagation, gradients are including: 1) replacing max-pool with stride 2 convolution,
added to the original 32-bit floating-point weights rather than 2) replacing leaky ReLU with ReLU, and 3) revising the hid-
den layer output channel. The improved efficiency on the

T Liang et al.: Preprint submitted to Elsevier Page 24 of 41

Survey on pruning and quantization

Table 3
Low Precision Libraries versus Accuracy for Common Networks in Multiple Frameworks.
Accuracy Float Accuracy Quant Accuracy Diff
Name Framework Method Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
AlexNet TensorRT [173] PTQ, w/o offset 57.08% 80.06% 57.05% 80.06% -0.03% 0.00%
Ristretto [90] Dynamic FP 56.90% 80.09% 56.14% 79.50% -0.76% -0.59%
Ristretto [90] Minifloat 56.90% 80.09% 52.26% 78.23% -4.64% -1.86%
Ristretto [90] Pow-of-two 56.90% 80.09% 53.57% 78.25% -3.33% -1.84%
GoogleNet NCNN [28] PTQ, w/o offset 68.50% 88.84% 68.62% 88.68% 0.12% -0.16%
TensorRT [173] PTQ, w/o offset 68.57% 88.83% 68.12% 88.64% -0.45% -0.19%
Ristretto [90] Dynamic FP 68.93% 89.16% 68.37% 88.63% -0.56% -0.53%
Ristretto [90] Minifloat 68.93% 89.16% 64.02% 87.69% -4.91% -1.47%
Ristretto [90] Pow-of-two 68.93% 89.16% 57.63% 81.38% -11.30% -7.78%
Inception v3 TF-Lite [77] PTQ 78.00% 93.80% 77.20% - -0.80% -
TF-Lite [77] QAT 78.00% 93.80% 77.50% 93.70% -0.50% -0.10%
MobileNet v1 NCNN [28] PTQ, w/o offset 67.26% 87.92% 66.74% 87.43% -0.52% -0.49%
Paddle [13] QAT+Pruning 70.91% - 69.20% - -1.71% -
TF-Lite [77] PTQ 70.90% - 65.70% - -5.20% -
TF-Lite [77] QAT 70.90% - 70.00% - -0.90% -
MobileNet v2 QNNPACK [61] PTQ, w/ offset 71.90% - 72.14% - 0.24% -
TF-Lite [77] PTQ 71.90% - 63.70% - -8.20% -
TF-Lite [77] QAT 71.90% - 70.90% - -1.00% -
ResNet-101 TensorRT [173] PTQ, w/o offset 74.39% 91.78% 74.40% 91.73% 0.01% -0.05%
TF-Lite [77] PTQ 77.00% - 76.80% - -0.20% -
ResNet-152 TensorRT [173] PTQ, w/o offset 74.78% 91.82% 74.70% 91.78% -0.08% -0.04%
ResNet-18 NCNN [28] PTQ, w/o offset 65.49% 86.56% 65.30% 86.52% -0.19% -0.04%
ResNet-50 NCNN [28] PTQ, w/o offset 71.80% 89.90% 71.76% 90.06% -0.04% 0.16%
TensorRT [173] PTQ, w/o offset 73.23% 91.18% 73.10% 91.06% -0.13% -0.12%
SqueezeNet NCNN [28] PTQ, w/o offset 57.78% 79.88% 57.82% 79.84% 0.04% -0.04%
Ristretto [90] Dynamic FP 57.68% 80.37% 57.21% 79.99% -0.47% -0.38%
Ristretto [90] Minifloat 57.68% 80.37% 54.80% 78.28% -2.88% -2.09%
Ristretto [90] Pow-of-two 57.68% 80.37% 41.60% 67.37% -16.08% -13.00%
VGG-19 TensorRT [173] PTQ, w/o offset 68.41% 88.78% 68.38% 88.70% -0.03% -0.08%

FPGA from 2.5 to 5 frames per second with 1.3% accuracy General Hardware: In addition to specialized hardware,
loss. INT8 quantization has been widely adopted in many general
TNN [6] is deployed on an FPGA with specialized com- purpose processor architectures. In this section we provide a
putation units optimized for ternary value multiplication. A high-level overview. A detailed survey on hardware efficiency
specific FPGA structure (dimensions) is determined during for processing DNNs can be found in [202].
synthesis to improve hardware efficiency. On the Sakura-X CNN acceleration on ARM CPUs was originally im-
FPGA board they achieved 255k MNIST image classifica- plemented by ARM advanced SIMD extensions known as
tions per second with an accuracy of 98.14%. A scalable de- NEON. The ARM 8.2 ISA extension added NEON support
sign implemented on a Xilinx Virtex-7 VC709 board dramat- for 8-bit integer matrix operations [8]. These were imple-
ically reduced hardware resources and power consumption mented in the CPU IP cores Cortex-A75 and A55 [9] as well
but at a significantly reduced throughput of 27k CIFAR-10 as the Mali-G76 GPU IP core [10]. These cores have been
images per second [197]. Power consumption for CIFAR-10 integrated into the Kirin SoC by Huawei, Qualcomm Snap-
was 6.8 Watts. dragon SoC, MediaTek Helio SoC, and Samsung Exynos
Reducing hardware costs is√a key objective of logarithmic [116]. For example on Exynos 9825 Octa, 8-bit integer quan-
hardware. Xu [249] adopted 2 based logarithmic quanti- tized MobileNet v2 can process an image in 19ms (52 images
zation with 5-bits of resolution. This showed 50.8% top-1 per second) using the Mali-G76 [116].
accuracy and dissipated a quarter of the power while using Intel improved the integer performance about 33% with
half the chip area. Half precision inference has a top-1 accu- Intel Advanced Vector Extension 512 (AVX-512) ISA [204].
racy of 53.8%. This 512-bit SIMD ISA extension included a Fused Multiply-

T Liang et al.: Preprint submitted to Elsevier Page 25 of 41

Neural Network Processing Performance
Survey on pruning and quantization

Legend
WaveSystem
Computation Precision
Data Center
Chips & DGX-2 Int1
Cards Int2
DGX-1
Arria GX1150
Baidu Int8
WaveDPU
Turing
DGX-Station Int8 -> Int16
Cambricon
Nervana2 V100
Data Center Int12 -> Int16
GOps/Second

Cambricon TPU3 Systems

TPU2
Int16
ArriaGX1155
Goya GraphCoreNode Int32
Nervana AMD-MI60
Float16
Peak

P100
Very Low Power Mobile TPU1
GraphCoreC2 Float16 -> Float32
s /W ZCU102 AMD-MI6
Op
Xavier K80 Float32
r a
e TrueNorth Rockchip DaDianNao Zynq-060 Phi7290F
10T
AIStorm
Cell
ArriaGX1150 TrueNorthSys
Phi7210F
2xSkyLakeSP Float64
PuDianNao
GPUs ArriaGX1150
A12 XilinxCluster
s /W Form Factor
MovidiusX Mali ArriaGX1150
p Zynq-020 -76 ArriaGX1150
ra O
JetsonTX2
Mali-75 Chip
e DianNao JetsonTX1
1T Zynq-020
S845
Stratix-V
Card
ShiDianNao S835 FPGAs
p s /W System
g aO Zynq-020
Gi TPUEdge
1 00 MIT Eyeriss
Computation Type
Inference
Training
Peak Power (W)
Presentation Name - 26 of
Author Initials MM/DD/YY Slide courtesy
Figure 17: Hardware of Albertfor
platforms Reuther,
neuralMIT Lincoln Laboratory
networks efficiencySupercomputing Centerfrom [202].
deploy, adopted

Add (FMA) instruction. should be properly stored for further deployment. Since
Low precision computation on nVidia GPUs was enabled the inference engine is uncertain, the stored IR should
since the Pascal series of GPUs [184]. The Turing GPU archi- include the model architecture and the trained weights.
tecture [188] introduced specialized units to processes INT4 A compiler can then read the model and optimize it for
and INT8. This provides real-time integer performance on AI a specific inference engine.
algorithms used in games. For embedded platforms, nVidia
developed Jetson platforms [187]. They use CUDA Maxwell • Compression: Compilers and optimizers should op-
cores [183] that can process half-precision types. For the data tionally be able to automatically compress arbitrary
center, nVidia developed the extremely high performance network structures using pruning and quantization.
DGX system [185]. It contains multiple high-end GPUs • Deployment: The final optimized model should be
interconnected using nVidia’s proprietary bus nVLINK. A mapped
DGX system can perform 4-bit integer to 32-bit floating point to the target engine(s) which may be heterogeneous.
operations.
Open Neural Network Exchange (ONNX) [190] is an
4.3.4. DNN Compilers open-source tool to parse AI models written for a variety
Heterogeneous neural networks hardware accelerators diverse frameworks. It imports and exports models using
are accelerating deep learning algorithm deployment [202]. an open-source format facilitating the translation of neural
Often exchange formats can be used to import/export models. network models between frameworks. It is thus capable of
Further, compilers have been developed to optimize models network parsing provided low-level operations are defined in
and generate code for specific processors. However several all target frameworks.
challenges remain: TVM [36], Glow [205], OpenVINO [118], and MLIR
[134] are deep learning compilers. They differ from frame-
• Network Parsing: Developers design neural network works such as Caffe in that they store intermediate repre-
models on different platforms using various frame- sentations and optimize those to map models onto specific
works and programming languages. However, they hardware engines. They typically integrate both quantization-
have common parts, such as convolution, activation, aware training and calibration-based post-training quantiza-
pooling, etc. Parsing tools analyze the model composi- tion. We summarize key features below. They perform all the
tions and transfer them into the unified representation. operations noted in our list. A detailed survey can be found
• Structure Optimization: The model may contain opera- in [149].
tions used in training that aren’t required for inference. TVM [36] leverages the efficiency of quantization by
Tool-kits and compilers should optimize these struc- enabling deployment of quantized models from PyTorch and
tures (e.g. BN folding as discussed in Section 2.5). TF-Lite. As a compiler, TVM has the ability to map the
model on general hardware such as Intel’s AVX and nVidia’s
• Intermediate Representation (IR): An optimized model CUDA.

T Liang et al.: Preprint submitted to Elsevier Page 26 of 41

Survey on pruning and quantization

Glow [205] enables quantization with zero points and further compression. Network pruning can be viewed as a
converts the data into 8-bit signed integers using a calibration- subset of NAS but with a smaller searching space. This is
based method. Neither Glow or TVM currently support especially true when the pruned architecture no longer needs
quantization-aware training although they both announced to use weights from the unpruned network (see Section 3.3).
future support for it [205]. In addition, some NAS techniques can also be applied to the
MLIR [134] and OpenVINO [118] have sophisticated pruning approach including borrowing trained coefficients
quantization support including quantization-aware training. and reinforcement learning search.
OpenVINO integrates it in TensorFlow and PyTorch while Typically, compression is evaluated on large data-sets
MLIR natively supports quantization-aware training. This such as the ILSVRC-2012 dataset with one thousand object
allows users to fine-tune an optimized model when it doesn’t categories. In practice, resource constraints in embedded
satisfy accuracy criteria. devices don’t allow a large capacity of optimized networks.
Compressing a model to best fit a constrained environment
4.4. Quantization Reduces Over-fitting should consider but not be limited to the deployment envi-
In addition to accelerating neural networks, quantization ronment, target device, speed/compression trade-offs, and
has also been found in some cases to result in higher accu- accuracy requirements [29].
racy. As examples: 1) 3-bit weights VGG-16 outperforms its Based on the reviewed pruning techniques, we recom-
full precision counterpart by 1.1% top-1 [144], 2) AlexNet re- mend the following for effective pruning:
duces 1.0% top-1 error of the reference with 2-bit weights and
8-bit activations [66], 3) ResNet-34 with 4-bit weights and ac- • Uniform pruning introduces accuracy loss therefore
tivation obtained 74.52% top-1 accuracy while the 32-bit ver- setting the pruning ratio to vary by layers is better
sion is 73.59% [174], 4) Zhou showed a quantized model re- [159].
duced the classification error by 0.15%, 2.28%, 0.13%, 0.71%,
• Dynamic pruning may result in higher accuracy and
and 1.59% on AlexNet, VGG-16, GoogLeNet, ResNet-18 and
maintain higher network capacity [246].
ResNet-50, respectively [269], and 5) Xu showed reduced
bit quantized networks help to reduce over-fitting on Fully • Structurally pruning a network may benefit from ma-
Connected Networks (FCNs). By taking advantage of strict turing libraries especially when pruning at a high level
constraints in biomedical image segmentation they improved [241].
segmentation accuracy by 1% combined with a 6.4× memory
usage reduction [251]. • Training a pruned model from scratch sometimes, but
not always (see Section 3.3), is more efficient than
tunning from the unpruned weights [160].
5. Summary
In this section we summarize the results of Pruning and • Penalty-based pruning typically reduces accuracy loss
Quantization. compared with magnitude-based pruning [255]. How-
ever, recent efforts are narrowing the gap [72].
5.1. Pruning
Section 3 shows pruning is an important technique for 5.2. Quantization
compressing neural networks. In this paper, we discussed Section 4 discusses quantization techniques. It describes
pruning techniques categorized as 1) static pruning and 2) binarized quantized neural networks, and reduced precision
dynamic pruning. Previously, static pruning was the domi- networks, along with their training methods. We described
nant area of research. Recently, dynamic pruning has become low-bit dataset validation techniques and results. We also
a focus because it can further improve performance even if list the accuracy of popular quantization frameworks and
static pruning has first been performed. described hardware implementations in Section 4.3.
Pruning can be performed in multiple ways. Element- Quantization usually results in a loss of accuracy due
wise pruning improves weight compression and storage. to information lost during the quantization process. This is
Channel-wise and shape-wise pruning can be accelerated particularly evident on compact networks. Most of the early
with specialized hardware and computation libraries. Filter- low bit quantization approaches only compare performance
wise and layer-wise pruning can dramatically reduce compu- on small datasets (e.g., MNIST, and CIFAR-10) [58, 94, 156,
tational complexity. 200, 235, 269]. However, observations showed that some
Though pruning sometimes introduces incremental im- quantized networks could outperform the original network
provement in accuracy by escaping a local minima [12], ac- (see: Section 4.4). Additionally, non-uniform distribution
curacy improvements are better realized by switching to a data may lead to further deterioration in quantization per-
better network architecture [24]. For example, a separable formance [275]. Sometimes this can be ameliorated by nor-
block may provide better accuracy with reduced computa- malization in fine-tuning [172] or by non-linear quantization
tional complexity [105]. Considering the evolution of net- (e.g., log representation) [175].
work structures, performance may also be bottlenecked by Advanced quantization techniques have improved accu-
the structure itself. From this point of view, Network Archi- racy. Asymmetric quantization [120] maintains higher dy-
tecture Search and Knowledge Distillation can be options for namic range by using a zero point in addition to a regular

T Liang et al.: Preprint submitted to Elsevier Page 27 of 41

Survey on pruning and quantization

scale parameter. Overheads introduced by the zero point were 155]. Automatic quantization is a technique to automatically
minimized by pipelining the processing unit. Calibration search quantization encoding to evaluate accuracy loss verses
based quantization [173] removed zero points and replaced compression ratio. Similarly, automatic prunning is a tech-
them with precise scales obtained from a calibrating dataset. nique to automatically search different prunning approaches
Quantization-aware training was shown to further improve to evaluate the sparsity ratio versus accuracy. Similar to hy-
quantization accuracy. perparameter tuning [257], this can be performed without
8-bit quantization is widely applied in practice as a good human intervention using any number of search techniques
trade-off between accuracy and compression. It can easily be (e.g. random search, genetic search, etc.).
deployed on current processors and custom hardware. Mini- Compression on Other Types of Neural Networks. Cur-
mal accuracy loss is experienced especially when quantization- rent compression research is primarily focused on CNNs.
aware training is enabled. Binarized networks have also More specifically, research is primarily directed towards CNN
achieved reasonable accuracy with specialized hardware de- classification tasks. Future work should also consider other
signs. types of applications such as object detection, speech recogni-
Though BN has advantages to help training and prun- tion, language translation, etc. Network compression verses
ing, an issue with BN is that it may require a large dynamic accuracy for different applications is an interesting area of
range across a single layer kernel or between different chan- research.
nels. This may make layer-wise quantization more difficult. Hardware Adaptation. Hardware implementations may
Because of this per channel quantization is recommended limit the effectiveness of pruning algorithms. For example,
[131]. element-wise pruning only slightly reduces computations or
To achieve better accuracy following quantization, we bandwidth when using im2col-gemm on GPU [264]. Sim-
recommend: ilarly, shape-wise pruning is not typically able to be imple-
mented on dedicated CNN accelerators. Hardware-software
• Use asymmetrical quantization. It preserves flexibility
co-design of compression techniques for hardware acceler-
over the quantization range even though it has compu-
ators should be considered to achieve the best system effi-
tational overheads [120].
ciency.
• Quantize the weights rather than the activations. Acti- Global Methods. Network optimizations are typically
vation is more sensitive to numerical precision [75]. applied separately without information from one optimiza-
tion informing any other optimization. Recently, approaches
• Do not quantize biases. They do not require significant that consider optimization effectiveness at multiple layers
storage. High precision biases in all layers [114], and have been proposed. [150] discusses pruning combined with
first/last layers [200, 272], maintain higher network tensor factorization that results in better overall compression.
accuracy. Similar techniques can be considered using different types
• Quantize kernels channel-wise instead of layer-wise to and levels of compression and factorization.
significantly improve accuracy [131].
• Fine-tune the quantized model. It reduces the accuracy 7. Conclusions
gap between the quantized model and the real-valued Deep neural networks have been applied in many applica-
model [244]. tions exhibiting extraordinary abilities in the field of computer
vision. However, complex network architectures challenge
• Initially train using a 32-bit floating point model. Low- efficient real-time deployment and require significant compu-
bit quantized model can be difficult to train from scratch tation resources and energy costs. These challenges can be
- especially compact models on large-scaled data-sets overcome through optimizations such as network compres-
[272]. sion. Network compression can often be realized with little
• The sensitivity of quantization is ordered as gradients, loss of accuracy. In some cases accuracy may even improve.
activations, and then weights [272]. Pruning can be categorized as static (Section 3.1) if it is
performed offline or dynamic (Section 3.2) if it is performed
• Stochastic quantization of gradients is necessary when at run-time. The criteria applied to removing redundant com-
training quantized models [89, 272]. putations if often just a simple magnitude of weights with
values near zero being pruned. More complicated methods
6. Future Work include checking the 𝑙𝑝 -norm. Techniques such as LASSO
and Ridge are built around 𝑙1 and 𝑙2 norms. Pruning can
Although punning and quantization algorithms help re- be performed element-wise, channel-wise, shape-wise, filter-
duce the computation cost and bandwidth burden, there are wise, layer-wise and even network-wise. Each has trade-offs
still areas for improvement. In this section we highlight future in compression, accuracy, and speedup.
work to further improvement quantization and prunning. Quantization reduces computations by reducing the preci-
Automatic Compression. Low bit width quantization sion of the datatype. Most networks are trained using 32-bit
can cause significant accuracy loss, especially when the quan- floating point. Weights, biases, and activations may then be
tized bit-width is very narrow and the dataset is large [272,

T Liang et al.: Preprint submitted to Elsevier Page 28 of 41

Survey on pruning and quantization

quantized typically to 8-bit integers. Lower bit width quan-

tizations have been performed with single bit being termed
a binary neural network. It is difficult to (re)train very low
bit width neural networks. A single bit is not differentiable
thereby prohibiting back propagation. Lower bit widths cause
difficulties for computing gradients. The advantage of quan-
tization is significantly improved performance (usually 2-3x)
and dramatically reduced storage requirements. In addition
to describing how quantization is performed we also included
an overview of popular libraries and frameworks that support
quantization. We further provided a comparison of accuracy
for a number of networks using different frameworks Table 2.
In this paper, we summarized pruning and quantization
techniques. Pruning removes redundant computations that
have limited contribution to a result. Quantization reduces
computations by reducing the precision of the datatype. Both
can be used independently or in combination to reduce storage
requirements and accelerate inference.

T Liang et al.: Preprint submitted to Elsevier Page 29 of 41

Survey on pruning and quantization

Model Deployment W A Top-1 Top-5 Ref.

8. Quantization Performance Results
ShiftCNN 1 4 11.26% 7.36% [84]
Table 4: Quantization Network Performance on LogQuant 32 3 13.50% 8.93% [30]
ILSVRC2012 for various bit-widths of the weights W LogQuant 3 3 18.07% 12.85% [30]
LogQuant 4 4 18.57% 13.21% [30]
and activation A (aka. feature) LogQuant 32 4 18.57% 13.21% [30]
BNN 1 1 24.20% 20.90% [53]
AngleEye 6 6 52.10% 57.35% [85]
Bit-width Acc. Drop
Model Deployment Ref. MobileNet HAQ-Cloud 6 6 -0.38% -0.23% [236]
W A Top-1 Top-5
V1 HAQ-Edge 6 6 -0.38% -0.34% [236]
AlexNet QuantNet 1 32 -1.70% -1.50% [253] MelinusNet59 1 1 -0.10% - [21]
BWNH 1 32 -1.40% -0.70% [107] HAQ-Edge 5 5 0.24% 0.08% [236]
SYQ 2 8 -1.00% -0.60% [66] PACT 6 6 0.36% 0.26% [44]
TSQ 2 2 -0.90% -0.30% [239] PACT 6 6 0.36% 0.26% [44]
INQ 5 32 -0.87% -1.39% [269] HAQ-Cloud 5 5 0.85% 0.48% [236]
PACT 4 3 -0.60% -1.00% [44] HAQ-Edge 4 4 3.42% 1.95% [236]
QIL 4 4 -0.20% - [127] PACT 5 5 3.82% 2.20% [44]
Mixed-Precision 16 16 -0.16% - [172] PACT 5 5 3.82% 2.20% [44]
PACT 32 5 -0.10% -0.20% [44] HAQ-Cloud 4 4 5.49% 3.25% [236]
QIL 5 5 -0.10% - [127] PACT 4 4 8.38% 5.66% [44]
QuantNet 3(±4) 32 -0.10% -0.10% [253] MobileNet HAQ-Edge 6 6 -0.08% -0.11% [236]
ELNN 3(±4) 32 0.00% 0.20% [144] V2 HAQ-Cloud 6 6 -0.04% 0.01% [236]
DoReFa-Net 32 3 0.00% -0.90% [272] Unified INT8 8 8 0.00% - [275]
TensorRT 8 8 0.03% 0.00% [173] PACT 6 6 0.56% 0.25% [44]
PACT 2 2 0.10% -0.70% [44] HAQ-Edge 5 5 0.91% 0.34% [236]
PACT 32 2 0.20% -0.20% [44] HAQ-Cloud 5 5 2.36% 1.31% [236]
DoReFa-Net 32 5 0.20% -0.50% [272] PACT 5 5 2.97% 1.67% [44]
QuantNet 3(±2) 32 0.30% 0.00% [253] HAQ-Cloud 4 4 4.80% 2.79% [236]
DoReFa-Net 32 4 0.30% -0.50% [272] HAQ-Edge 4 4 4.82% 2.92% [236]
WRPN 2 32 0.40% - [174] PACT 4 4 10.42% 6.53% [44]
DFP16 16 16 0.49% 0.59% [54]
PACT 3 2 0.50% -0.10% [44] ResNet-18 RangeBN 8 8 -0.60% - [15]
PACT 4 2 0.50% -0.10% [44] LBM 8 8 -0.60% - [268]
SYQ 1 8 0.50% 0.80% [66] QuantNet 5 32 -0.30% -0.10% [253]
QIL 3 3 0.50% - [127] QIL 5 5 -0.20% - [127]
FP8 8 8 0.50% - [237] QuantNet 3(±4) 32 -0.10% -0.10% [253]
BalancedQ 32 2 0.60% -2.00% [273] ShiftCNN 3 4 0.03% 0.12% [84]
ELNN 3(±2) 32 0.80% 0.60% [144] LQ-NETs 4 32 0.20% 0.50% [262]
SYQ 1 4 0.90% 0.80% [66] QIL 3 32 0.30% 0.30% [127]
QuantNet 2 32 0.90% 0.30% [253] LPBN 32 5 0.30% 0.40% [31]
FFN 2 32 1.00% 0.30% [238] QuantNet 3(±2) 32 0.40% 0.20% [253]
DoReFa-Net 32 2 1.00% 0.10% [272] PACT 32 4 0.40% 0.30% [44]
Unified INT8 8 8 1.00% - [275] SeerNet 4 1 0.42% 0.18% [32]
DeepShift-PS 6 32 1.19% 0.67% [62] ShiftCNN 2 4 0.54% 0.34% [84]
WEQ 4 4 1.20% 1.00% [192] PACT 5 5 0.60% 0.30% [44]
LQ-NETs 2 32 1.30% 0.80% [262] INQ 4 32 0.62% 0.10% [269]
SYQ 2 2 1.30% 1.00% [66] Unified INT8 8 8 0.63% - [275]
LQ-NETs 1 2 1.40% 1.40% [262] QIL 5 5 0.80% - [127]
BalancedQ 2 2 1.40% -1.00% [273] LQ-NETs 3(±4) 32 0.90% 0.80% [262]
WRPN-2x 8 8 1.50% - [174] QIL 3 3 1.00% - [127]
DoReFa-Net 1 4 1.50% - [272] DeepShift-Q 6 32 1.09% 0.47% [62]
DeepShift-Q 6 32 1.55% 0.81% [62] ELNN 3(±2) 32 1.10% 0.70% [144]
WRPN-2x 32 8 1.60% - [174] PACT 32 3 1.20% 0.70% [44]
WEQ 3 4 1.60% 1.10% [192] PACT 4 4 1.20% 0.60% [44]
WRPN-2x 8 4 1.70% - [174] QuantNet 2 32 1.20% 0.60% [253]
WRPN-2x 4 8 1.70% - [174] ELNN 3(±4) 32 1.30% 0.60% [144]
SYQ 1 2 1.70% 1.60% [66] DeepShift-PS 6 32 1.44% 0.67% [62]
ELNN 2 32 1.80% 1.80% [144] ABC-Net 5 32 1.46% 1.18% [155]
WRPN-2x 4 4 1.90% - [174] ELNN 3(±2) 32 1.60% 1.10% [144]
WRPN-2x 32 4 1.90% - [174] DoReFa-Net 32 5 1.70% 1.00% [272]
SYQ 2 8 1.90% 1.40% [66]
GoogLeNet Mixed-Precision 16 16 -0.10% - [172] DoReFa-Net 32 4 1.90% 1.10% [272]
DeepShift-PS 6 32 -0.09% -0.09% [62] LQ-NETs 3 3 2.00% 1.60% [262]
DFP16 16 16 -0.08% 0.00% [54] DoReFa-Net 5 5 2.00% 1.30% [272]
AngleEye 16 16 0.05% 0.45% [85] ELNN 2 32 2.10% 1.50% [144]
AngleEye 16 16 0.05% 0.45% [85] QIL 2 32 2.10% 1.30% [127]
ShiftCNN 3 4 0.05% 0.09% [84] DoReFa-Net 32 3 2.10% 1.40% [272]
DeepShift-Q 6 32 0.27% 0.29% [62] QIL 4 4 2.20% - [127]
LogQuant 32 6 0.36% 0.28% [30] LQ-NETs 2 32 2.20% 1.60% [262]
ShiftCNN 2 4 0.39% 0.29% [84] GroupNet-8 1 1 2.20% 1.40% [276]
TensorRT 8 8 0.45% 0.19% [173] PACT 3 3 2.30% 1.40% [44]
LogQuant 6 32 0.64% 0.67% [30] DoReFa-Net 4 4 2.30% 1.50% [272]
INQ 5 32 0.76% 0.25% [269] TTN 2 32 2.50% 1.80% [274]
ELNN 3(±4) 32 2.40% 1.40% [144] TTQ 2 32 2.70% 2.00% [277]
ELNN 3(±2) 32 2.80% 1.60% [144] AddNN 32 32 2.80% 1.50% [35]
LogQuant 6 6 3.43% 0.78% [30] ELNN 2 32 2.80% 1.50% [144]
QNN 4 4 5.10% 7.80% [113] LPBN 32 4 2.90% 1.70% [31]
QNN 6 6 5.20% 8.10% [113] PACT 32 2 2.90% 2.00% [44]
ELNN 2 32 5.60% 3.50% [144] DoReFa-Net 3 3 2.90% 2.00% [272]
BWN 1 32 5.80% 4.80% [200] QuantNet 1 32 3.10% 1.90% [253]
AngleEye 8 8 6.00% 3.20% [85] INQ 2 32 3.10% 1.90% [269]
TWN 2 32 7.50% 4.80% [146]
ELNN 1 32 8.40% 5.70% [144] ResNet-34 WRPN-2x 4 4 -0.93% - [174]
BWN 2 32 9.70% 6.50% [200] WRPN-2x 4 8 -0.89% - [174]

T Liang et al.: Preprint submitted to Elsevier Page 30 of 41

Survey on pruning and quantization

Model Deployment W A Top-1 Top-5 Ref. Model Deployment W A Top-1 Top-5 Ref.
QIL 4 4 0.00% - [127] SYQ 1 8 5.40% 3.40% [66]
QIL 5 5 0.00% - [127] DoReFa-Net 4 4 5.50% 3.30% [272]
WRPN-2x 4 2 0.01% - [174] DoReFa-Net 5 5 5.50% -0.20% [272]
WRPN-2x 2 4 0.09% - [174] FGQ 2 8 5.60% - [170]
WRPN-2x 2 2 0.27% - [174] ABC-Net 5 5 6.30% 3.50% [155]
SeerNet 4 1 0.35% 0.17% [32] FGQ-TWN 2 4 6.67% - [170]
Unified INT8 8 8 0.39% - [275] HWGQ 1 2 6.90% 4.60% [31]
LCCL 0.43% 0.17% [59] ResNet-100 IAO 8 8 1.40% - [120]
QIL 3 3 0.60% - [127] ResNet-101 TensorRT 8 8 -0.01% 0.05% [173]
WRPN-3x 1 1 0.90% - [174] FGQ-TWN 2 8 3.65% - [170]
WRPN-3x 1 1 1.21% - [174] FGQ-TWN 2 4 6.81% - [170]
GroupNet-8 1 1 1.40% 1.00% [276] ResNet-150 IAO 8 8 2.10% - [120]
dLAC 2 16 1.67% 0.89% [235] ResNet-152 TensorRT 8 8 0.08% 0.04% [173]
LQ-NETs 3 3 1.90% 1.20% [262] dLAC 2 16 1.20% 0.64% [235]
GroupNet**-5 1 1 2.70% 2.10% [276]
IR-Net 1 32 2.90% 1.80% [199] SqueezeNet AngleEye 16 16 0.00% 0.01% [85]
QIL 2 2 3.10% - [127] ShiftCNN 3 4 0.01% 0.01% [84]
WRPN-2x 1 1 3.40% - [174] ShiftCNN 2 4 1.01% 0.71% [84]
WRPN-2x 1 1 3.74% - [174] AngleEye 8 8 1.42% 1.05% [85]
LQ-NETs 2 2 4.00% 2.30% [262] AngleEye 6 6 28.13% 27.43% [85]
GroupNet-5 1 1 4.70% 3.40% [276] ShiftCNN 1 4 35.39% 35.09% [84]
ABC-Net 5 5 4.90% 3.10% [155] VGG-16 ELNN 3(±4) 32 -1.10% -1.00% [144]
HWGQ 1 32 5.10% 3.40% [31] ELNN 3(±2) 32 -0.60% -0.80% [144]
WAGEUBN 8 8 5.18% - [254] AngleEye 16 16 0.09% -0.05% [85]
ABC-Net 3 3 6.60% 3.90% [155] DFP16 16 16 0.11% 0.29% [54]
LQ-NETs 1 2 6.70% 4.40% [262] AngleEye 8 8 0.21% 0.08% [85]
LQ-NETs 4 4 6.70% 4.40% [262] SeerNet 4 1 0.28% 0.10% [32]
BCGD 1 4 7.60% 4.70% [256] DeepShift-Q 6 32 0.29% 0.11% [62]
HWGQ 1 2 9.00% 5.60% [31] FFN 2 32 0.30% -0.20% [238]
IR-Net 1 1 9.50% 6.20% [199] DeepShift-PS 6 32 0.47% 0.30% [62]
CI-BCNN (add) 1 1 11.07% 6.39% [240] DeepShift-Q 6 32 0.72% 0.29% [62]
Bi-Real 1 1 11.10% 7.40% [252] INQ 5 32 0.77% 0.08% [62]
WRPN-1x 1 1 12.80% - [174] TWN 2 32 1.10% 0.30% [146]
WRPN 1 1 13.05% - [174] ELNN 2 32 2.00% 0.90% [144]
CI-BCNN 1 1 13.59% 8.65% [240] TSQ 2 2 2.00% 0.70% [239]
DoReFa-Net 1 4 14.60% - [272] AngleEye 16 16 2.15% 1.49% [85]
DoReFa-Net 1 2 20.40% - [272] BWN 2 32 2.20% 1.20% [200]
ABC-Net 1 1 20.90% 14.80% [155] AngleEye 8 8 2.35% 1.76% [85]
BNN 1 1 29.10% 24.20% [272] ELNN 1 32 3.30% 1.80% [144]
ResNet-50 Mixed-Precision 16 16 -0.12% - [172] AngleEye 6 6 9.07% 6.58% [85]
DFP16 16 16 -0.07% -0.06% [54] AngleEye 6 6 22.38% 17.75% [85]
QuantNet 5 32 0.00% 0.00% [253] LogQuant 3 3 - 0.99% [30]
LQ-NETs 4 32 0.00% 0.10% [262] LogQuant 4 4 - 0.51% [30]
FGQ 32 32 0.00% - [170] LogQuant 6 6 - 0.83% [30]
TensorRT 8 8 0.13% 0.12% [173] LogQuant 32 3 - 0.82% [30]
PACT 5 5 0.20% -0.20% [44] LogQuant 32 4 - 0.36% [30]
QuantNet 3(±4) 32 0.20% 0.00% [253] LogQuant 32 6 - 0.31% [30]
Unified INT8 8 8 0.26% - [275] LogQuant 6 32 - 0.76% [30]
ShiftCNN 3 4 0.29% 0.15% [84] LDR 5 4 - 0.90% [175]
ShiftCNN 3 4 0.31% 0.16% [84] LogNN 5 4 - 1.38% [175]
PACT 4 4 0.40% -0.10% [44]
LPBN 32 5 0.40% 0.40% [81]
ShiftCNN 2 4 0.67% 0.41% [84]
DeepShift-Q 6 32 0.81% 0.21% [62]
DeepShift-PS 6 32 0.84% 0.31% [62]
PACT 5 32 0.90% 0.20% [44]
QuantNet 3(±2) 32 0.90% 0.40% [253]
PACT 4 32 1.00% 0.20% [44]
dLAC 2 16 1.20% - [235]
QuantNet 2 32 1.20% 0.60% [253]
AddNN 32 32 1.30% 1.20% [35]
LQ-NETs 4 4 1.30% 0.80% [262]
LQ-NETs 2 32 1.30% 0.90% [262]
INQ 5 32 1.32% 0.41% [269]
PACT 3 32 1.40% 0.50% [44]
IAO 8 8 1.50% - [120]
PACT 3 3 1.60% 0.50% [44]
HAQ 2MP 4MP 1.91% - [236]
HAQ MP MP 2.09% - [236]
LQ-NETs 3 3 2.20% 1.60% [262]
LPBN 32 4 2.20% 1.20% [81]
Deep Comp. 3 MP 2.29% - [92]
PACT 4 2 2.40% 1.20% [44]
ShiftCNN 2 4 2.49% 1.64% [84]
FFN 2 32 2.50% 1.30% [238]
UNIQ 4 8 2.60% - [18]
QuantNet 1 32 3.20% 1.70% [253]
SYQ 2 8 3.70% 2.10% [66]
FGQ-TWN 2 8 4.29% - [170]
PACT 2 2 4.70% 2.60% [44]
LQ-NETs 2 2 4.90% 2.90% [262]

T Liang et al.: Preprint submitted to Elsevier Page 31 of 41

Survey on pruning and quantization

References jection for Non-Uniform Quantization of Neural Networks. arXiv

preprint arXiv:1804.10969 URL: http://arxiv.org/abs/1804.10969.
[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, [19] Bengio, E., Bacon, P.L., Pineau, J., Precup, D., 2015. Conditional
C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Computation in Neural Networks for faster models. ArXiv preprint
Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, URL: http://arxiv.org/abs/1511.06297.
R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., [20] Bengio, Y., 2013. Estimating or Propagating Gradients Through
Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, Stochastic Neurons. ArXiv preprint URL: http://arxiv.org/abs/
B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, 1305.2982.
V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., [21] Bethge, J., Bartz, C., Yang, H., Chen, Y., Meinel, C., 2020. MeliusNet:
Yu, Y., Zheng, X., 2016. TensorFlow: Large-Scale Machine Learn- Can Binary Neural Networks Achieve MobileNet-level Accuracy?
ing on Heterogeneous Distributed Systems. arXiv preprint arXiv: ArXiv preprint URL: http://arxiv.org/abs/2001.05936.
1603.04467 URL: https://arxiv.org/abs/1603.04467. [22] Bethge, J., Yang, H., Bornstein, M., Meinel, C., 2019. Binary-
[2] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Deng, L., Penn, G., DenseNet: Developing an architecture for binary neural networks.
Yu, D., 2014. Convolutional Neural Networks for Speech Recogni- Proceedings - 2019 International Conference on Computer Vision
tion. IEEE/ACM Transactions on Audio, Speech, and Language Pro- Workshop, ICCVW 2019 , 1951–1960doi:10.1109/ICCVW.2019.00244.
cessing 22, 1533–1545. URL: http://ieeexplore.ieee.org/document/ [23] Bianco, S., Cadene, R., Celona, L., Napoletano, P., 2018. Benchmark
6857341/, doi:10.1109/TASLP.2014.2339736. analysis of representative deep neural network architectures. IEEE
[3] Abdelouahab, K., Pelcat, M., Serot, J., Berry, F., 2018. Accelerating Access 6, 64270–64277. doi:10.1109/ACCESS.2018.2877890.
CNN inference on FPGAs: A Survey. ArXiv preprint URL: http: [24] Blalock, D., Ortiz, J.J.G., Frankle, J., Guttag, J., 2020. What is
//arxiv.org/abs/1806.01683. the State of Neural Network Pruning? ArXiv preprint URL: http:
[4] Achronix Semiconductor Corporation, 2020. FPGAs Enable the Next //arxiv.org/abs/2003.03033.
Generation of Communication and Networking Solutions. White [25] Bolukbasi, T., Wang, J., Dekel, O., Saligrama, V., 2017. Adaptive
Paper WP021, 1–15. Neural Networks for Efficient Inference. Thirty-fourth International
[5] Albanie, 2020. convnet-burden. URL: https://github.com/albanie/ Conference on Machine Learning URL: https://arxiv.org/abs/1702.
convnet-burden. 07811http://arxiv.org/abs/1702.07811.
[6] Alemdar, H., Leroy, V., Prost-Boucle, A., Petrot, F., 2017. Ternary [26] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,
neural networks for resource-efficient AI applications, in: 2017 Inter- P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S.,
national Joint Conference on Neural Networks (IJCNN), IEEE. pp. Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A.,
2547–2554. URL: https://ieeexplore.ieee.org/abstract/document/ Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
7966166/, doi:10.1109/IJCNN.2017.7966166. Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
[7] AMD, . Radeon Instinct™ MI25 Accelerator. URL: https://www. S., Radford, A., Sutskever, I., Amodei, D., 2020. Language Models
amd.com/en/products/professional-graphics/instinct-mi25. are Few-Shot Learners. ArXiv preprint URL: http://arxiv.org/abs/
[8] Arm, 2015. ARM Architecture Reference Man- 2005.14165.
ual ARMv8, for ARMv8-A architecture profile. [27] Buciluǎ, C., Caruana, R., Niculescu-Mizil, A., 2006. Model compres-
https://developer.arm.com/documentation/ddi0487/latest. URL: sion, in: Proceedings of the 12th ACM SIGKDD international con-
https://developer.arm.com/documentation/ddi0487/latest. ference on Knowledge discovery and data mining - KDD ’06, ACM
[9] Arm, 2020. Arm Cortex-M Processor Comparison Table. URL: Press, New York, New York, USA. p. 535. URL: https://dl.acm.
https://developer.arm.com/ip-products/processors/cortex-a. org/doi/abs/10.1145/1150402.1150464, doi:10.1145/1150402.1150464.
[10] Arm, Graphics, C., 2020. MALI-G76 High-Performance [28] BUG1989, 2019. BUG1989/caffe-int8-convert-tools: Generate a
GPU for Complex Graphics Features and Bene ts High Perfor- quantization parameter file for ncnn framework int8 inference. URL:
mance for Mixed Realities. URL: https://www.arm.com/products/ https://github.com/BUG1989/caffe-INT8-convert-tools.
silicon-ip-multimedia/gpu/mali-g76. [29] Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S., 2019. Once-for-All:
[11] ARM, Reddy, V.G., 2008. Neon technology introduction. ARM Cor- Train One Network and Specialize it for Efficient Deployment. ArXiv
poration , 1–34URL: http://caxapa.ru/thumbs/301908/AT_-_NEON_ preprint , 1–15URL: http://arxiv.org/abs/1908.09791.
for_Multimedia_Applications.pdf. [30] Cai, J., Takemoto, M., Nakajo, H., 2018. A Deep Look into Loga-
[12] Augasta, M.G., Kathirvalavakumar, T., 2013. Pruning algorithms of rithmic Quantization of Model Parameters in Neural Networks, in:
neural networks - A comparative study. Open Computer Science 3, Proceedings of the 10th International Conference on Advances in
105–115. doi:10.2478/s13537-013-0109-x. Information Technology - IAIT 2018, ACM Press, New York, New
[13] Baidu, 2019. PArallel Distributed Deep LEarning: Machine Learn- York, USA. pp. 1–8. URL: http://dl.acm.org/citation.cfm?doid=
ing Framework from Industrial Practice. URL: https://github.com/ 3291280.3291800, doi:10.1145/3291280.3291800.
PaddlePaddle/Paddle. [31] Cai, Z., He, X., Sun, J., Vasconcelos, N., 2017. Deep Learning with
[14] Balzer, W., Takahashi, M., Ohta, J., Kyuma, K., 1991. Weight Low Precision by Half-Wave Gaussian Quantization, in: 2017 IEEE
quantization in Boltzmann machines. Neural Networks 4, 405–409. Conference on Computer Vision and Pattern Recognition (CVPR),
doi:10.1016/0893-6080(91)90077-I. IEEE. pp. 5406–5414. URL: http://ieeexplore.ieee.org/document/
[15] Banner, R., Hubara, I., Hoffer, E., Soudry, D., 2018. 8100057/, doi:10.1109/CVPR.2017.574.
Scalable methods for 8-bit training of neural networks, [32] Cao, S., Ma, L., Xiao, W., Zhang, C., Liu, Y., Zhang, L.,
in: Advances in Neural Information Processing Systems Nie, L., Yang, Z., 2019. SeerNet : Predicting Convolu-
(NIPS), pp. 5145–5153. URL: http://papers.nips.cc/paper/ tional Neural Network Feature-Map Sparsity through Low-
7761-scalable-methods-for-8-bit-training-of-neural-networks. Bit Quantization. Proceedings of the IEEE/CVF Conference
[16] Banner, R., Nahshan, Y., Soudry, D., 2019. Post training 4-bit quanti- on Computer Vision and Pattern Recognition (CVPR) URL:
zation of convolutional networks for rapid-deployment, in: Advances http://openaccess.thecvf.com/content_CVPR_2019/papers/Cao_
in Neural Information Processing Systems (NIPS), pp. 7950–7958. SeerNet_Predicting_Convolutional_Neural_Network_Feature-Map_
[17] Baoyuan Liu, Min Wang, Foroosh, H., Tappen, M., Penksy, M., 2015. Sparsity_Through_Low-Bit_Quantization_CVPR_2019_paper.pdf.
Sparse Convolutional Neural Networks, in: 2015 IEEE Conference on [33] Carreira-Perpinan, M.A., Idelbayev, Y., 2018. "Learning-
Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 806– Compression" Algorithms for Neural Net Pruning, in: IEEE/CVF
814. URL: http://ieeexplore.ieee.org/document/7298681/, doi:10. Conference on Computer Vision and Pattern Recognition (CVPR),
1109/CVPR.2015.7298681. IEEE. pp. 8532–8541. URL: https://ieeexplore.ieee.org/document/
[18] Baskin, C., Schwartz, E., Zheltonozhskii, E., Liss, N., Giryes, R., 8578988/, doi:10.1109/CVPR.2018.00890.
Bronstein, A.M., Mendelson, A., 2018. UNIQ: Uniform Noise In-

T Liang et al.: Preprint submitted to Elsevier Page 32 of 41

Survey on pruning and quantization

[34] Chellapilla, K., Puri, S., Simard, P., 2006. High Performance Con- Artificial Intelligence Review 53, 5113–5155. URL: https://doi.
volutional Neural Networks for Document Processing, in: Tenth org/10.1007/s10462-020-09816-7, doi:10.1007/s10462-020-09816-7.
International Workshop on Frontiers in Handwriting Recognition. [49] Cornea, M., 2015. Intel ® AVX-512 Instructions and Their Use in
URL: https://hal.inria.fr/inria-00112631/, doi:10.1.1.137.482. the Implementation of Math Functions. Intel Corporation .
[35] Chen, H., Wang, Y., Xu, C., Shi, B., Xu, C., Tian, Q., Xu, C., 2020. [50] Cotofana, S., Vassiliadis, S., Logic, T., Addition, B., Addition, S.,
AdderNet: Do We Really Need Multiplications in Deep Learning?, in: 1997. Low Weight and Fan-In Neural Networks for Basic Arithmetic
Proceedings of the IEEE/CVF Conference on Computer Vision and Operations, in: 15th IMACS World Congress, pp. 227–232. doi:10.
Pattern Recognition (CVPR), pp. 1468–1477. URL: http://arxiv. 1.1.50.4450.
org/abs/1912.13200. [51] Courbariaux, M., Bengio, Y., David, J.P., 2014. Training deep neu-
[36] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., ral networks with low precision multiplications, in: International
Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., Krishnamurthy, Conference on Learning Representations(ICLR), pp. 1–10. URL:
A., 2018. TVM: An automated end-to-end optimizing compiler for http://arxiv.org/abs/1412.7024, doi:arXiv:1412.7024.
deep learning, in: Proceedings of the 13th USENIX Symposium [52] Courbariaux, M., Bengio, Y., David, J.P., 2015. BinaryConnect:
on Operating Systems Design and Implementation, OSDI 2018, pp. Training Deep Neural Networks with binary weights during propa-
579–594. URL: http://arxiv.org/abs/1802.04799. gations, in: Advances in Neural Information Processing Systems
[37] Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y., 2015. (NIPS), pp. 1–9. URL: http://arxiv.org/abs/1511.00363, doi:10.
Compressing neural networks with the hashing trick., in: In Inter- 5555/2969442.2969588.
national Conference on Machine Learning, pp. 2285–2294. URL: [53] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio,
http://arxiv.org/abs/1504.04788. Y., 2016. Binarized Neural Networks: Training Deep Neural
[38] Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O., 2016. DianNao Networks with Weights and Activations Constrained to +1 or -1.
family: Energy-Efficient Hardware Accelerators for Machine ArXiv preprint URL: https://github.com/MatthieuCourbariaux/http:
Learning. Communications of the ACM 59, 105–112. URL: //arxiv.org/abs/1602.02830.
10.1145/2594446%5Cnhttps://ejwl.idm.oclc.org/login?url=http: [54] Das, D., Mellempudi, N., Mudigere, D., Kalamkar, D., Avancha, S.,
//search.ebscohost.com/login.aspx?direct=true&db=bth&AN= Banerjee, K., Sridharan, S., Vaidyanathan, K., Kaul, B., Georganas,
95797996&site=ehost-livehttp://dl.acm.org/citation.cfm?doid= E., Heinecke, A., Dubey, P., Corbal, J., Shustrov, N., Dubtsov,
3013530.2996864, doi:10.1145/2996864. R., Fomenko, E., Pirogov, V., 2018. Mixed Precision Training of
[39] Cheng, J., Wang, P.s., Li, G., Hu, Q.h., Lu, H.q., 2018. Recent ad- Convolutional Neural Networks using Integer Operations, in: In-
vances in efficient computation of deep convolutional neural networks. ternational Conference on Learning Representations(ICLR),
Frontiers of Information Technology & Electronic Engineering pp. 1–11. URL: https://www.anandtech.com/show/11741/
19, 64–77. URL: http://link.springer.com/10.1631/FITEE.1700789, hot-chips-intel-knights-mill-live-blog-445pm-pt-1145pm-utchttp:
doi:10.1631/FITEE.1700789. //arxiv.org/abs/1802.00930.
[40] Cheng, Y., Wang, D., Zhou, P., Zhang, T., 2017. A Survey of Model [55] Dash, M., Liu, H., 1997. Feature selection for classification. Intelli-
Compression and Acceleration for Deep Neural Networks. ArXiv gent Data Analysis 1, 131–156. doi:10.3233/IDA-1997-1302.
preprint URL: http://arxiv.org/abs/1710.09282. [56] Davis, A., Arel, I., 2013. Low-Rank Approximations for Conditional
[41] Cheng, Z., Soudry, D., Mao, Z., Lan, Z., 2015. Training Binary Mul- Feedforward Computation in Deep Neural Networks, in: International
tilayer Neural Networks for Image Classification using Expectation Conference on Learning Representations Workshops (ICLRW), pp.
Backpropagation. ArXiv preprint URL: http://cn.arxiv.org/pdf/ 1–10. URL: http://arxiv.org/abs/1312.4461.
1503.03562.pdfhttp://arxiv.org/abs/1503.03562. [57] Deng, W., Yin, W., Zhang, Y., 2013. Group sparse optimiza-
[42] Chiliang, Z., Tao, H., Yingda, G., Zuochang, Y., 2019. Accelerating tion by alternating direction method, in: Van De Ville, D.,
Convolutional Neural Networks with Dynamic Channel Pruning, in: Goyal, V.K., Papadakis, M. (Eds.), Wavelets and Sparsity XV,
2019 Data Compression Conference (DCC), IEEE. pp. 563–563. p. 88580R. URL: http://proceedings.spiedigitallibrary.org/
URL: https://ieeexplore.ieee.org/document/8712710/, doi:10.1109/ proceeding.aspx?doi=10.1117/12.2024410, doi:10.1117/12.2024410.
DCC.2019.00075. [58] Dettmers, T., 2015. 8-Bit Approximations for Parallelism
[43] Choi, B., Lee, J.H., Kim, D.H., 2008. Solving local minima in Deep Learning, in: International Conference on Learn-
problem with large number of hidden nodes on two-layered feeding Representations(ICLR). URL: https://github.com/soumith/
forward artificial neural networks. Neurocomputing 71, 3640–3643. convnet-benchmarkshttp://arxiv.org/abs/1511.04561.
doi:10.1016/j.neucom.2008.04.004. [59] Dong, X., Huang, J., Yang, Y., Yan, S., 2017. More is less: A more
[44] Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.j., Srinivasan, V., complicated network with less inference complexity. Proceedings -
Gopalakrishnan, K., 2018. PACT: Parameterized Clipping Activation 30th IEEE Conference on Computer Vision and Pattern Recognition,
for Quantized Neural Networks. ArXiv preprint , 1–15URL: http: CVPR 2017 2017-Janua, 1895–1903. URL: http://arxiv.org/abs/
//arxiv.org/abs/1805.06085. 1703.08651, doi:10.1109/CVPR.2017.205.
[45] Choi, Y., El-Khamy, M., Lee, J., 2017a. Towards the Limit of Network [60] Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S., 1990. A set
Quantization, in: International Conference on Learning Represen- of level 3 basic linear algebra subprograms. ACM Transactions on
tations(ICLR), IEEE. URL: https://arxiv.org/abs/1612.01543http: Mathematical Software (TOMS) 16, 1–17. doi:10.1145/77626.79170.
//arxiv.org/abs/1612.01543. [61] Dukhan, M., Yiming, W., Hao, L., Lu, H., 2019. QNNPACK:
[46] Choi, Y., Member, S.S., Bae, D., Sim, J., Member, S.S., Choi, S., Open source library for optimized mobile deep learning - Facebook
Kim, M., Member, S.S., Kim, L.s.S., Member, S.S., 2017b. Energy- Engineering. URL: https://engineering.fb.com/ml-applications/
Efficient Design of Processing Element for Convolutional Neural qnnpack/.
Network. IEEE Transactions on Circuits and Systems II: Express [62] Elhoushi, M., Chen, Z., Shafiq, F., Tian, Y.H., Li, J.Y., 2019.
Briefs 64, 1332–1336. URL: http://ieeexplore.ieee.org/document/ DeepShift: Towards Multiplication-Less Neural Networks. ArXiv
7893765/, doi:10.1109/TCSII.2017.2691771. preprint URL: http://arxiv.org/abs/1905.13298.
[47] Chollet, F., Google, C., 2017. Xception : Deep Learning with [63] Elsken, T., Metzen, J.H., Hutter, F., 2019. Neural Architec-
Depthwise Separable Convolutions, in: The IEEE Conference on ture Search. Journal of Machine Learning Research 20, 63–
Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 1251– 77. URL: http://link.springer.com/10.1007/978-3-030-05318-5_3,
1258. URL: http://ieeexplore.ieee.org/document/8099678/, doi:10. doi:10.1007/978-3-030-05318-5{\_}3.
1109/CVPR.2017.195. [64] Engelbrecht, A.P., 2001. A new pruning heuristic based on variance
[48] Choudhary, T., Mishra, V., Goswami, A., Sarangapani, J., 2020. analysis of sensitivity information. IEEE Transactions on Neural
A comprehensive survey on model compression and acceleration. Networks 12, 1386–1389. doi:10.1109/72.963775.

T Liang et al.: Preprint submitted to Elsevier Page 33 of 41

Survey on pruning and quantization

[65] Esser, S.K., Merolla, P.A., Arthur, J.V., Cassidy, A.S., Appuswamy, 08983.
R., Andreopoulos, A., Berg, D.J., McKinstry, J.L., Melano, T., Barch, [83] Greff, K., Srivastava, R.K., Schmidhuber, J., 2016. Highway and
D.R., di Nolfo, C., Datta, P., Amir, A., Taba, B., Flickner, M.D., Residual Networks learn Unrolled Iterative Estimation, in: Inter-
Modha, D.S., 2016. Convolutional networks for fast, energy-efficient national Conference on Learning Representations(ICLR), pp. 1–14.
neuromorphic computing. Proceedings of the National Academy URL: http://arxiv.org/abs/1612.07771.
of Sciences 113, 11441–11446. URL: http://www.pnas.org/lookup/ [84] Gudovskiy, D.A., Rigazio, L., 2017. ShiftCNN: Generalized Low-
doi/10.1073/pnas.1604850113, doi:10.1073/pnas.1604850113. Precision Architecture for Inference of Convolutional Neural Net-
[66] Faraone, J., Fraser, N., Blott, M., Leong, P.H., 2018. SYQ: Learning works. ArXiv preprint URL: http://arxiv.org/abs/1706.02393.
Symmetric Quantization for Efficient Deep Neural Networks, in: [85] Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S., Wang, Y.,
Proceedings of the IEEE/CVF Conference on Computer Vision and Yang, H., 2018. Angel-Eye: A complete design flow for mapping
Pattern Recognition (CVPR). CNN onto embedded FPGA. IEEE Transactions on Computer-Aided
[67] Fiesler, E., Choudry, A., Caulfield, H.J., 1990. Weight discretization Design of Integrated Circuits and Systems 37, 35–47. URL: https://
paradigm for optical neural networks. Optical Interconnections and ieeexplore.ieee.org/abstract/document/7930521/, doi:10.1109/TCAD.
Networks 1281, 164. doi:10.1117/12.20700. 2017.2705069.
[68] Figurnov, M., Collins, M.D., Zhu, Y., Zhang, L., Huang, J., Vetrov, [86] Guo, K., Zeng, S., Yu, J., Wang, Y., Yang, H., 2017. A Survey of
D., Salakhutdinov, R., 2017. Spatially Adaptive Computation Time FPGA-Based Neural Network Accelerator. ACM Transactions on
for Residual Networks, in: IEEE/CVF Conference on Computer Reconfigurable Technology and Systems 9. URL: http://arxiv.org/
Vision and Pattern Recognition (CVPR), IEEE. pp. 1790–1799. abs/1712.08934https://arxiv.org/abs/1712.08934.
URL: http://ieeexplore.ieee.org/document/8099677/, doi:10.1109/ [87] Guo, Y., 2018. A Survey on Methods and Theories of Quantized
CVPR.2017.194. Neural Networks. ArXiv preprint URL: http://arxiv.org/abs/1808.
[69] FPGA, I., . Intel® FPGA Development Tools - Intel 04752.
FPGA. URL: https://www.intel.com/content/www/us/en/software/ [88] Guo, Y., Yao, A., Chen, Y., 2016. Dynamic Network Surgery for
programmable/overview.html. Efficient DNNs, in: Advances in Neural Information Processing Sys-
[70] Frankle, J., Carbin, M., 2019. The lottery ticket hypothesis: Finding tems (NIPS), pp. 1379–1387. URL: http://papers.nips.cc/paper/
sparse, trainable neural networks, in: International Conference on 6165-dynamic-network-surgery-for-efficient-dnns.
Learning Representations(ICLR). URL: http://arxiv.org/abs/1803. [89] Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P., 2015.
03635. Deep learning with limited numerical precision, in: International
[71] Fukushima, K., 1988. Neocognitron: A hierarchical neural network Conference on Machine Learning (ICML), pp. 1737–1746.
capable of visual pattern recognition. Neural Networks 1, 119–130. [90] Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S., 2018. Ristretto: A
doi:10.1016/0893-6080(88)90014-7. Framework for Empirical Study of Resource-Efficient Inference in
[72] Gale, T., Elsen, E., Hooker, S., 2019. The State of Sparsity in Deep Convolutional Neural Networks. IEEE Transactions on Neural Net-
Neural Networks. ArXiv preprint URL: http://arxiv.org/abs/1902. works and Learning Systems 29, 1–6. URL: https://ieeexplore.ieee.
09574. org/abstract/document/8318896/, doi:10.1109/TNNLS.2018.2808319.
[73] Gao, X., Zhao, Y., Dudziak, L., Mullins, R., Xu, C.Z., Dudziak, L., [91] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A.,
Mullins, R., Xu, C.Z., 2019. Dynamic Channel Pruning: Feature Dally, W.J., 2016a. EIE: Efficient Inference Engine on Compressed
Boosting and Suppression, in: International Conference on Learning Deep Neural Network, in: 2016 ACM/IEEE 43rd Annual Inter-
Representations (ICLR), pp. 1–14. URL: http://arxiv.org/abs/1810. national Symposium on Computer Architecture (ISCA), IEEE. pp.
05331. 243–254. URL: http://ieeexplore.ieee.org/document/7551397/http:
[74] Glossner, J., Blinzer, P., Takala, J., 2016. HSA-enabled DSPs and //arxiv.org/abs/1602.01528, doi:10.1109/ISCA.2016.30.
accelerators. 2015 IEEE Global Conference on Signal and Informa- [92] Han, S., Mao, H., Dally, W.J., 2016b. Deep compression: Compress-
tion Processing, GlobalSIP 2015 , 1407–1411doi:10.1109/GlobalSIP. ing deep neural networks with pruning, trained quantization and Huff-
2015.7418430. man coding, in: International Conference on Learning Representa-
[75] Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F., Yan, J., tions(ICLR), pp. 199–203. URL: http://arxiv.org/abs/1510.00149.
2019. Differentiable soft quantization: Bridging full-precision and [93] Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen,
low-bit neural networks, in: Proceedings of the IEEE International E., Vajda, P., Paluri, M., Tran, J., Catanzaro, B., Dally, W.J., 2016c.
Conference on Computer Vision (ICCV), pp. 4851–4860. doi:10. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, in:
1109/ICCV.2019.00495. International Conference on Learning Representations(ICLR). URL:
[76] Gong, Y., Liu, L., Yang, M., Bourdev, L., 2014. Compressing Deep http://arxiv.org/abs/1607.04381.
Convolutional Networks using Vector Quantization, in: International [94] Han, S., Pool, J., Tran, J., Dally, W.J., 2015. Learning both Weights
Conference on Learning Representations(ICLR). URL: http://arxiv. and Connections for Efficient Neural Networks, in: Advances in
org/abs/1412.6115. Neural Information Processing Systems (NIPS), pp. 1135–1143.
[77] Google, . Hosted models | TensorFlow Lite. URL: https://www. URL: http://arxiv.org/abs/1506.02626, doi:10.1016/S0140-6736(95)
tensorflow.org/lite/guide/hosted_models. 92525-2.
[78] Google, 2018. google/gemmlowp: Low-precision matrix mul- [95] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.,
tiplication. https://github.com/google/gemmlowp. URL: https: Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y., 2014.
//github.com/google/gemmlowp. Deep Speech: Scaling up end-to-end speech recognition. ArXiv
[79] Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., Choi, preprint , 1–12URL: http://arxiv.org/abs/1412.5567.
E., 2018. MorphNet: Fast & Simple Resource-Constrained Struc- [96] HANSON, S., 1989. Comparing biases for minimal network con-
ture Learning of Deep Networks, in: Proceedings of the IEEE/CVF struction with back-propagation, in: Advances in Neural Information
Conference on Computer Vision and Pattern Recognition (CVPR), Processing Systems (NIPS), pp. 177–185.
IEEE. pp. 1586–1595. URL: https://ieeexplore.ieee.org/document/ [97] Hassibi, B., Stork, D.G., Wolff, G.J., 1993. Optimal brain surgeon
8578269/, doi:10.1109/CVPR.2018.00171. and general network pruning. doi:10.1109/icnn.1993.298572.
[80] Gou, J., Yu, B., Maybank, S.J., Tao, D., 2020. Knowledge Distillation: [98] He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual Learn-
A Survey. ArXiv preprint URL: http://arxiv.org/abs/2006.05525. ing for Image Recognition, in: IEEE/CVF Conference on Com-
[81] Graham, B., 2017. Low-Precision Batch-Normalized Activations. puter Vision and Pattern Recognition (CVPR), IEEE. pp. 171–180.
ArXiv preprint , 1–16URL: http://arxiv.org/abs/1702.08231. URL: http://arxiv.org/abs/1512.03385http://ieeexplore.ieee.org/
[82] Graves, A., 2016. Adaptive Computation Time for Recurrent Neural document/7780459/, doi:10.3389/fpsyg.2013.00124.
Networks. ArXiv preprint , 1–19URL: http://arxiv.org/abs/1603. [99] He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y., 2018. Soft Filter

T Liang et al.: Preprint submitted to Elsevier Page 34 of 41

Survey on pruning and quantization

Pruning for Accelerating Deep Convolutional Neural Networks, in: URL: https://arxiv.org/abs/1602.07360http://arxiv.org/abs/1602.
Proceedings of the Twenty-Seventh International Joint Conference on 07360, doi:10.1007/978-3-319-24553-9.
Artificial Intelligence (IJCAI-18), International Joint Conferences on [116] Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K., Baum, F.,
Artificial Intelligence Organization, California. pp. 2234–2240. URL: Wu, M., Xu, L., Van Gool, L., 2019. AI benchmark: All about
http://arxiv.org/abs/1808.06866, doi:10.24963/ijcai.2018/309. deep learning on smartphones in 2019. Proceedings - 2019 In-
[100] He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y., 2019. Filter Pruning ternational Conference on Computer Vision Workshop, ICCVW
via Geometric Median for Deep Convolutional Neural Networks 2019 , 3617–3635URL: https://developer.arm.com/documentation/
Acceleration. IEEE/CVF Conference on Computer Vision and Pattern ddi0487/latest, doi:10.1109/ICCVW.2019.00447.
Recognition (CVPR) URL: http://arxiv.org/abs/1811.00250. [117] Imagination, . PowerVR - embedded graphics processors
[101] He, Y., Zhang, X., Sun, J., 2017. Channel Pruning for Accel- powering iconic products. URL: https://www.imgtec.com/
erating Very Deep Neural Networks, in: IEEE International graphics-processors/.
Conference on Computer Vision (ICCV), IEEE. pp. 1398–1406. [118] Intel, . OpenVINO™ Toolkit. URL: https://docs.openvinotoolkit.
URL: http://openaccess.thecvf.com/content_ICCV_2017/papers/He_ org/latest/index.html.
Channel_Pruning_for_ICCV_2017_paper.pdfhttp://ieeexplore.ieee. [119] Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep
org/document/8237417/, doi:10.1109/ICCV.2017.155. network training by reducing internal covariate shift, in: International
[102] Hinton, G., 2012. Neural networks for machine learning. Technical Conference on Machine Learning (ICML), pp. 448–456. URL: http:
Report. Coursera. //arxiv.org/abs/1502.03167.
[103] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhut- [120] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.,
dinov, R.R., 2012. Improving neural networks by preventing co- Adam, H., Kalenichenko, D., 2018. Quantization and Training of
adaptation of feature detectors. ArXiv preprint , 1–18URL: http: Neural Networks for Efficient Integer-Arithmetic-Only Inference, in:
//arxiv.org/abs/1207.0580. IEEE/CVF Conference on Computer Vision and Pattern Recognition
[104] Hou, L., Yao, Q., Kwok, J.T., 2017. Loss-aware Binarization of (CVPR), IEEE. pp. 2704–2713. URL: https://ieeexplore.ieee.org/
Deep Networks, in: International Conference on Learning Represen- document/8578384/, doi:10.1109/CVPR.2018.00286.
tations(ICLR). URL: http://arxiv.org/abs/1611.01600. [121] Jia, Z., Tillman, B., Maggioni, M., Scarpazza, D.P., 2019. Dissect-
[105] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., ing the graphcore IPU architecture via microbenchmarking. ArXiv
Weyand, T., Andreetto, M., Adam, H., 2017. MobileNets: Effi- preprint .
cient Convolutional Neural Networks for Mobile Vision Applications. [122] Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, Li Fei-Fei, 2009.
ArXiv preprint URL: http://arxiv.org/abs/1704.04861. ImageNet: A large-scale hierarchical image database. IEEE/CVF
[106] Hu, H., Peng, R., Tai, Y.W., Tang, C.K., 2016. Network Trimming: Conference on Computer Vision and Pattern Recognition (CVPR) ,
A Data-Driven Neuron Pruning Approach towards Efficient Deep Ar- 248–255doi:10.1109/cvprw.2009.5206848.
chitectures. ArXiv preprint URL: http://arxiv.org/abs/1607.03250. [123] Jianchang Mao, Mohiuddin, K., Jain, A., 1994. Parsimonious network
[107] Hu, Q., Wang, P., Cheng, J., 2018. From hashing to CNNs: Train- design and feature selection through node pruning, in: Proceedings
ing binary weight networks via hashing, in: AAAI Conference on of the 12th IAPR International Conference on Pattern Recognition,
Artificial Intelligence, pp. 3247–3254. Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5),
[108] Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, IEEE Comput. Soc. Press. pp. 622–624. URL: http://ieeexplore.
K., 2018. Multi-scale dense networks for resource efficient image ieee.org/document/577060/, doi:10.1109/icpr.1994.577060.
classification, in: International Conference on Learning Representa- [124] Jiao, Y., Han, L., Long, X., 2020. Hanguang 800 NPU – The Ultimate
tions(ICLR). URL: http://image-net.org/challenges/talks/. AI Inference Solution for Data Centers, in: 2020 IEEE Hot Chips 32
[109] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Symposium (HCS), IEEE. pp. 1–29. URL: https://ieeexplore.ieee.
Densely Connected Convolutional Networks, in: IEEE/CVF Confer- org/document/9220619/, doi:10.1109/HCS49909.2020.9220619.
ence on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. [125] Jouppi, N.P., Borchers, A., Boyle, R., Cantin, P.l., Chao, C., Clark,
2261–2269. URL: https://ieeexplore.ieee.org/document/8099726/, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Young, C.,
doi:10.1109/CVPR.2017.243. Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho,
[110] Huang, G.B., Learned-miller, E., 2014. Labeled faces in the wild: C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Patil, N.,
Updates and new reporting procedures. Dept. Comput. Sci., Univ. Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch,
Massachusetts Amherst, Amherst, MA, USA, Tech. Rep 14, 1–5. A., Kumar, N., Lacy, S., Laudon, J., Law, J., Patterson, D., Le, D.,
[111] Huang, Z., Wang, N., 2018. Data-Driven Sparse Structure Selec- Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A.,
tion for Deep Neural Networks, in: Lecture Notes in Computer Sci- Mahony, M., Miller, K., Nagarajan, R., Agrawal, G., Narayanaswami,
ence (including subseries Lecture Notes in Artificial Intelligence and R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps,
Lecture Notes in Bioinformatics). volume 11220 LNCS, pp. 317– A., Ross, J., Ross, M., Salek, A., Bajwa, R., Samadiani, E., Severn, C.,
334. URL: http://link.springer.com/10.1007/978-3-030-01270-0_ Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan,
19, doi:10.1007/978-3-030-01270-0{\_}19. M., Thorson, G., Tian, B., Bates, S., Toma, H., Tuttle, E., Vasudevan,
[112] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H., Bhatia, S., Boden,
Y., 2016a. Binarized Neural Networks, in: Advances in Neural N., 2017. In-Datacenter Performance Analysis of a Tensor Processing
Information Processing Systems (NIPS), pp. 4114–4122. URL: Unit. ACM SIGARCH Computer Architecture News 45, 1–12. URL:
http://papers.nips.cc/paper/6573-binarized-neural-networks. http://dl.acm.org/citation.cfm?doid=3140659.3080246, doi:10.1145/
[113] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y., 3140659.3080246.
2016b. Quantized Neural Networks: Training Neural Networks with [126] Judd, P., Delmas, A., Sharify, S., Moshovos, A., 2017. Cnvlutin2:
Low Precision Weights and Activations. Journal of Machine Learning Ineffectual-Activation-and-Weight-Free Deep Neural Network Com-
Research 18 18, 187–1. URL: http://arxiv.org/abs/1609.07061. puting. ArXiv preprint , 1–6URL: https://arxiv.org/abs/1705.
[114] Hwang, K., Sung, W., 2014. Fixed-point feedforward deep neural 00125.
network design using weights +1, 0, and -1, in: 2014 IEEE Workshop [127] Jung, S., Son, C., Lee, S., Son, J., Kwak, Y., Han, J.J., Hwang, S.J.,
on Signal Processing Systems (SiPS), IEEE. pp. 1–6. URL: https:// Choi, C., 2018. Learning to Quantize Deep Networks by Optimizing
ieeexplore.ieee.org/abstract/document/6986082/, doi:10.1109/SiPS. Quantization Intervals with Task Loss. Revue Internationale de la
2014.6986082. Croix-Rouge et Bulletin international des Sociétés de la Croix-Rouge
[115] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., URL: http://arxiv.org/abs/1808.05779, doi:arXiv:1808.05779v2.
Keutzer, K., 2016. SqueezeNet: AlexNet-level accuracy with 50x [128] Kathail, V., 2020. Xilinx Vitis Unified Software Platform, in: Pro-
fewer parameters and <0.5MB model size, in: ArXiv e-prints. ceedings of the 2020 ACM/SIGDA International Symposium on

T Liang et al.: Preprint submitted to Elsevier Page 35 of 41

Survey on pruning and quantization

Field-Programmable Gate Arrays, ACM, New York, NY, USA. pp. [144] Leng, C., Li, H., Zhu, S., Jin, R., 2018. Extremely Low Bit Neural
173–174. URL: https://dl.acm.org/doi/10.1145/3373087.3375887, Network: Squeeze the Last Bit Out with ADMM. The Thirty-Second
doi:10.1145/3373087.3375887. AAAI Conference on Artificial Intelligence (AAAI-18) URL: http:
[129] Keil, 2018. CMSIS NN Software Library. URL: https:// //arxiv.org/abs/1707.09870.
arm-software.github.io/CMSIS_5/NN/html/index.html. [145] Leroux, S., Bohez, S., De Coninck, E., Verbelen, T., Vankeirsbilck,
[130] Köster, U., Webb, T.J., Wang, X., Nassar, M., Bansal, A.K., Consta- B., Simoens, P., Dhoedt, B., 2017. The cascading neural network:
ble, W.H., Elibol, O.H., Gray, S., Hall, S., Hornof, L., Khosrowshahi, building the Internet of Smart Things. Knowledge and Informa-
A., Kloss, C., Pai, R.J., Rao, N., 2017. Flexpoint: An Adaptive tion Systems 52, 791–814. URL: http://link.springer.com/10.1007/
Numerical Format for Efficient Training of Deep Neural Networks. s10115-017-1029-1, doi:10.1007/s10115-017-1029-1.
ArXiv preprint URL: http://arxiv.org/abs/1711.02213. [146] Li, F., Zhang, B., Liu, B., 2016. Ternary Weight Networks, in:
[131] Krishnamoorthi, R., 2018. Quantizing deep convolutional net- Advances in Neural Information Processing Systems (NIPS). URL:
works for efficient inference: A whitepaper. ArXiv preprint 8, 667– http://arxiv.org/abs/1605.04711.
668. URL: http://cn.arxiv.org/pdf/1806.08342.pdfhttp://arxiv. [147] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P., 2017a.
org/abs/1806.08342, doi:arXiv:1806.08342v1. Pruning Filters for Efficient ConvNets, in: International Conference
[132] Krizhevsky, A., 2009. Learning Multiple Layers of Features from on Learning Representations (ICLR). URL: http://arxiv.org/abs/
Tiny Images. Science Department, University of Toronto, Tech. 1608.08710, doi:10.1029/2009GL038531.
doi:10.1.1.222.9220. [148] Li, H., Zhang, H., Qi, X., Ruigang, Y., Huang, G., 2019. Im-
[133] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Clas- proved Techniques for Training Adaptive Deep Networks, in: 2019
sification with Deep Convolutional Neural Networks, in: Advances IEEE/CVF International Conference on Computer Vision (ICCV),
in Neural Information Processing Systems (NIPS), pp. 1–9. URL: IEEE. pp. 1891–1900. URL: https://ieeexplore.ieee.org/document/
http://code.google.com/p/cuda-convnet/, doi:http://dx.doi.org/10. 9010043/, doi:10.1109/ICCV.2019.00198.
1016/j.protcy.2014.09.007. [149] Li, M., Liu, Y.I., Liu, X., Sun, Q., You, X.I.N., Yang, H., Luan, Z.,
[134] Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pien- Gan, L., Yang, G., Qian, D., 2020a. The Deep Learning Compiler:
aar, J., Riddle, R., Shpeisman, T., Vasilache, N., Zinenko, O., 2020. A Comprehensive Survey. ArXiv preprint 1, 1–36. URL: http:
MLIR: A Compiler Infrastructure for the End of Moore’s Law. ArXiv //arxiv.org/abs/2002.03794.
preprint URL: http://arxiv.org/abs/2002.11054. [150] Li, Y., Gu, S., Mayer, C., Van Gool, L., Timofte, R., 2020b.
[135] Lavin, A., Gray, S., 2016. Fast Algorithms for Convolu- Group Sparsity: The Hinge Between Filter Pruning and Decom-
tional Neural Networks, in: IEEE/CVF Conference on Com- position for Network Compression, in: 2020 IEEE/CVF Conference
puter Vision and Pattern Recognition (CVPR), IEEE. pp. 4013– on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
4021. URL: http://ieeexplore.ieee.org/document/7780804/http:// 8015–8024. URL: https://ieeexplore.ieee.org/document/9157445/,
arxiv.org/abs/1312.5851, doi:10.1109/CVPR.2016.435. doi:10.1109/CVPR42600.2020.00804.
[136] Lebedev, V., Lempitsky, V., 2016. Fast ConvNets Using Group-Wise [151] Li, Z., Wang, Y., Zhi, T., Chen, T., 2017b. A survey of neural
Brain Damage, in: IEEE/CVF Conference on Computer Vision network accelerators. Frontiers of Computer Science 11, 746–761.
and Pattern Recognition (CVPR), IEEE. pp. 2554–2564. URL: URL: http://link.springer.com/10.1007/s11704-016-6159-1, doi:10.
http://openaccess.thecvf.com/content_cvpr_2016/html/Lebedev_ 1007/s11704-016-6159-1.
Fast_ConvNets_Using_CVPR_2016_paper.htmlhttp://ieeexplore.ieee. [152] Li, Z., Zhang, Y., Wang, J., Lai, J., 2020c. A survey of FPGA design
org/document/7780649/, doi:10.1109/CVPR.2016.280. for AI era. Journal of Semiconductors 41. doi:10.1088/1674-4926/
[137] Lebedev, V., Lempitsky, V., 2018. Speeding-up convolutional 41/2/021402.
neural networks: A survey. BULLETIN OF THE POLISH [153] Lin, J., Rao, Y., Lu, J., Zhou, J., 2017a. Runtime Neural
ACADEMY OF SCIENCES TECHNICAL SCIENCES 66, Pruning, in: Advances in Neural Information Processing Sys-
2018. URL: http://www.czasopisma.pan.pl/Content/109869/PDF/ tems (NIPS), pp. 2178–2188. URL: https://papers.nips.cc/paper/
05_799-810_00925_Bpast.No.66-6_31.12.18_K2.pdf?handler=pdfhttp: 6813-runtime-neural-pruning.pdf.
//www.czasopisma.pan.pl/Content/109869/PDF/05_799-810_00925_ [154] Lin, M., Chen, Q., Yan, S., 2014. Network in network, in: Interna-
Bpast.No.66-6_31.12.18_K2.pdf, doi:10.24425/bpas.2018.125927. tional Conference on Learning Representations(ICLR), pp. 1–10.
[138] Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, [155] Lin, X., Zhao, C., Pan, W., 2017b. Towards accurate binary convolu-
436–444. doi:10.1038/nature14539. tional neural network, in: Advances in Neural Information Processing
[139] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient- Systems (NIPS), pp. 345–353.
based learning applied to document recognition. Proceedings of the [156] Lin, Z., Courbariaux, M., Memisevic, R., Bengio, Y., 2016.
IEEE 86, 2278–2323. URL: http://ieeexplore.ieee.org/document/ Neural Networks with Few Multiplications, in: Interna-
726791/, doi:10.1109/5.726791. tional Conference on Learning Representations(ICLR). URL:
[140] LeCun, Y., Denker, J.S., Solla, S.A., 1990. Optimal Brain Damage, https://github.com/hantek/http://arxiv.org/abs/1510.03009https:
in: Advances in Neural Information Processing Systems (NIPS), p. //arxiv.org/abs/1510.03009.
598–605. doi:10.5555/109230.109298. [157] Liu, J., Musialski, P., Wonka, P., Ye, J., 2013. Tensor Completion for
[141] Lee, N., Ajanthan, T., Torr, P.H., 2019. SnIP: Single-shot network Estimating Missing Values in Visual Data. IEEE Transactions on Pat-
pruning based on connection sensitivity, in: International Conference tern Analysis and Machine Intelligence 35, 208–220. URL: http://
on Learning Representations(ICLR). ieeexplore.ieee.org/document/6138863/, doi:10.1109/TPAMI.2012.39.
[142] Lei, J., Gao, X., Song, J., Wang, X.L., Song, M.L., 2018. Survey [158] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C., 2017. Learn-
of Deep Neural Network Model Compression. Ruan Jian Xue ing Efficient Convolutional Networks through Network Slimming,
Bao/Journal of Software 29, 251–266. URL: https://www.scopus.com/ in: IEEE International Conference on Computer Vision (ICCV),
inward/record.uri?eid=2-s2.0-85049464636&doi=10.13328%2Fj.cnki. IEEE. pp. 2755–2763. URL: http://ieeexplore.ieee.org/document/
jos.005428&partnerID=40&md5=5a79dfdff4a05f188c5d553fb3b3123a, 8237560/, doi:10.1109/ICCV.2017.298.
doi:10.13328/j.cnki.jos.005428. [159] Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, T.K.T., Sun, J.,
[143] Lei, W., Chen, H., Wu, Y., 2017. Compressing Deep Convolutional 2019a. MetaPruning: Meta Learning for Automatic Neural Network
Networks Using K-means Based on Weights Distribution, in: Pro- Channel Pruning, in: IEEE International Conference on Computer
ceedings of the 2nd International Conference on Intelligent Informa- Vision. URL: http://arxiv.org/abs/1903.10258.
tion Processing - IIP’17, ACM Press, New York, New York, USA. pp. [160] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T., 2019b. Re-
1–6. URL: http://dl.acm.org/citation.cfm?doid=3144789.3144803, thinking the Value of Network Pruning, in: International Confer-
doi:10.1145/3144789.3144803. ence on Learning Representations (ICLR), pp. 1–11. URL: http:

T Liang et al.: Preprint submitted to Elsevier Page 36 of 41

Survey on pruning and quantization

//arxiv.org/abs/1810.05270. 1–17. URL: http://arxiv.org/abs/1611.06440.

[161] Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., Cheng, K.T., 2018. Bi- [178] Moss, D.J.M., Nurvitadhi, E., Sim, J., Mishra, A., Marr, D., Sub-
Real Net: Enhancing the performance of 1-bit CNNs with improved haschandra, S., Leong, P.H.W., 2017. High performance binary
representational capability and advanced training algorithm. Lecture neural networks on the Xeon+FPGA™ platform, in: 2017 27th Inter-
Notes in Computer Science (including subseries Lecture Notes in national Conference on Field Programmable Logic and Applications
Artificial Intelligence and Lecture Notes in Bioinformatics) 11219 (FPL), IEEE. pp. 1–4. URL: https://ieeexplore.ieee.org/abstract/
LNCS, 747–763. doi:10.1007/978-3-030-01267-0{\_}44. document/8056823/, doi:10.23919/FPL.2017.8056823.
[162] Liu, Z.G., Mattina, M., 2019. Learning low-precision neural net- [179] Moudgill, M., Glossner, J., Huang, W., Tian, C., Xu, C., Yang, N.,
works without Straight-Through Estimator (STE), in: IJCAI Inter- Wang, L., Liang, T., Shi, S., Zhang, X., Iancu, D., Nacer, G., Li, K.,
national Joint Conference on Artificial Intelligence, International 2020. Heterogeneous Edge CNN Hardware Accelerator, in: The 12th
Joint Conferences on Artificial Intelligence Organization, California. International Conference on Wireless Communications and Signal
pp. 3066–3072. URL: https://www.ijcai.org/proceedings/2019/425, Processing, pp. 6–11.
doi:10.24963/ijcai.2019/425. [180] Muller, L.K., Indiveri, G., 2015. Rounding Methods for Neural
[163] Luo, J.H., Wu, J., 2020. AutoPruner: An end-to-end trainable filter Networks with Low Resolution Synaptic Weights. ArXiv preprint
pruning method for efficient deep model inference. Pattern Recogni- URL: http://arxiv.org/abs/1504.05767.
tion 107, 107461. URL: https://linkinghub.elsevier.com/retrieve/ [181] Muthukrishnan, R., Rohini, R., 2016. LASSO: A feature selection
pii/S0031320320302648, doi:10.1016/j.patcog.2020.107461. technique in predictive modeling for machine learning, in: 2016
[164] Luo, J.H.H., Wu, J., Lin, W., 2017. ThiNet: A Filter Level Pruning IEEE International Conference on Advances in Computer Applica-
Method for Deep Neural Network Compression. Proceedings of the tions (ICACA), IEEE. pp. 18–20. URL: http://ieeexplore.ieee.org/
IEEE International Conference on Computer Vision (ICCV) 2017- document/7887916/, doi:10.1109/ICACA.2016.7887916.
Octob, 5068–5076. URL: http://ieeexplore.ieee.org/document/ [182] Neill, J.O., 2020. An Overview of Neural Network Compression.
8237803/, doi:10.1109/ICCV.2017.541. ArXiv preprint , 1–73URL: http://arxiv.org/abs/2006.03669.
[165] Ma, Y., Suda, N., Cao, Y., Seo, J.S., Vrudhula, S., 2016. Scalable [183] NVIDIA Corporation, 2014. NVIDIA GeForce GTX 980 Featur-
and modularized RTL compilation of Convolutional Neural Net- ing Maxwell, The Most Advanced GPU Ever Made. White Paper ,
works onto FPGA. FPL 2016 - 26th International Conference on 1–32URL: http://international.download.nvidia.com/geforce-com/
Field-Programmable Logic and Applications doi:10.1109/FPL.2016. international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.
7577356. [184] NVIDIA Corporation, 2015. NVIDIA Tesla P100. White Paper URL:
[166] Macchi, O., 1975. Coincidence Approach To Stochastic Point Pro- https://www.nvidia.com/en-us/data-center/tesla-p100/.
cess. Advances in Applied Probability 7, 83–122. doi:10.1017/ [185] NVIDIA Corporation, 2017a. NVIDIA DGX-1 With Tesla V100
s0001867800040313. System Architecture. White Paper URL: http://images.nvidia.com/
[167] Mariet, Z., Sra, S., 2016. Diversity Networks: Neural Network content/pdf/dgx1-v100-system-architecture-whitepaper.pdf.
Compression Using Determinantal Point Processes, in: International [186] NVIDIA Corporation, 2017b. NVIDIA Tesla V100
Conference on Learning Representations(ICLR), pp. 1–13. URL: GPU Volta Architecture. White Paper , 53URL:
http://arxiv.org/abs/1511.05077. http://images.nvidia.com/content/volta-architecture/pdf/
[168] Mathieu, M., Henaff, M., LeCun, Y., 2013. Fast Training of Con- volta-architecture-whitepaper.pdf%0Ahttp://www.nvidia.com/
volutional Networks through FFTs. ArXiv preprint URL: http: content/gated-pdfs/Volta-Architecture-Whitepaper-v1.1.pdf.
//arxiv.org/abs/1312.5851. [187] NVIDIA Corporation, 2018a. NVIDIA A100 Tensor Core GPU.
[169] Medina, E., 2019. Habana Labs presentation. 2019 IEEE Hot Chips White Paper , 20–21.
31 Symposium, HCS 2019 doi:10.1109/HOTCHIPS.2019.8875670. [188] NVIDIA Corporation, 2018b. NVIDIA Turing GPU Architecture.
[170] Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, B., Dubey, White Paper URL: https://gpltech.com/wp-content/uploads/2018/
P., 2017. Ternary Neural Networks with Fine-Grained Quantization. 11/NVIDIA-Turing-Architecture-Whitepaper.pdf.
ArXiv preprint URL: http://arxiv.org/abs/1705.01462. [189] Odena, A., Lawson, D., Olah, C., 2017. Changing Model Behav-
[171] Merolla, P., Appuswamy, R., Arthur, J., Esser, S.K., Modha, D., 2016. ior at Test-Time Using Reinforcement Learning, in: International
Deep neural networks are robust to weight binarization and other non- Conference on Learning Representations Workshops (ICLRW), In-
linear distortions. ArXiv preprint URL: https://arxiv.org/abs/1606. ternational Conference on Learning Representations, ICLR. URL:
01981http://arxiv.org/abs/1606.01981. http://arxiv.org/abs/1702.07780.
[172] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, [190] ONNX, . onnx/onnx: Open standard for machine learning interoper-
D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., Wu, ability. URL: https://github.com/onnx/onnx.
H., 2017. Mixed Precision Training, in: International Conference on [191] Ouyang, J., Noh, M., Wang, Y., Qi, W., Ma, Y., Gu, C., Kim, S.,
Learning Representations(ICLR). URL: http://arxiv.org/abs/1710. Hong, K.i., Bae, W.K., Zhao, Z., Wang, J., Wu, P., Gong, X., Shi, J.,
03740. Zhu, H., Du, X., 2020. Baidu Kunlun An AI processor for diversified
[173] Migacz, S., 2017. 8-bit inference with TensorRT. GPU Technol- workloads, in: 2020 IEEE Hot Chips 32 Symposium (HCS), IEEE. pp.
ogy Conference 2, 7. URL: https://on-demand.gputechconf.com/gtc/ 1–18. URL: https://ieeexplore.ieee.org/document/9220641/, doi:10.
2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf. 1109/HCS49909.2020.9220641.
[174] Mishra, A., Nurvitadhi, E., Cook, J.J., Marr, D., 2018. WRPN: [192] Park, E., Ahn, J., Yoo, S., 2017. Weighted-Entropy-Based Quan-
Wide reduced-precision networks, in: International Conference on tization for Deep Neural Networks, in: IEEE/CVF Conference
Learning Representations(ICLR), pp. 1–11. on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
[175] Miyashita, D., Lee, E.H., Murmann, B., 2016. Convolu- 7197–7205. URL: http://ieeexplore.ieee.org/document/8100244/,
tional Neural Networks using Logarithmic Data Representation. doi:10.1109/CVPR.2017.761.
ArXiv preprint URL: http://cn.arxiv.org/pdf/1603.01025.pdfhttp: [193] Paszke, A., Gross, S., Bradbury, J., Lin, Z., Devito, Z., Massa, F.,
//arxiv.org/abs/1603.01025. Steiner, B., Killeen, T., Yang, E., 2019. PyTorch : An Imperative
[176] Molchanov, D., Ashukha, A., Vetrov, D., 2017. Variational dropout Style , High-Performance Deep Learning Library. ArXiv preprint .
sparsifies deep neural networks, in: International Conference on [194] Pilipović, R., Bulić, P., Risojević, V., 2018. Compression of convolu-
Machine Learning (ICML), pp. 3854–3863. URL: https://dl.acm. tional neural networks: A short survey, in: 2018 17th International
org/citation.cfm?id=3305939. Symposium on INFOTEH-JAHORINA, INFOTEH 2018 - Proceed-
[177] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J., 2016. Pruning ings, IEEE. pp. 1–6. URL: https://ieeexplore.ieee.org/document/
Convolutional Neural Networks for Resource Efficient Inference, in: 8345545/, doi:10.1109/INFOTEH.2018.8345545.
International Conference on Learning Representations (ICLR), pp. [195] Polyak, A., Wolf, L., 2015. Channel-level acceleration of deep

T Liang et al.: Preprint submitted to Elsevier Page 37 of 41

Survey on pruning and quantization

face representations. IEEE Access 3, 2163–2175. URL: http: batch normalization help optimization?, in: Advances in Neural In-
//ieeexplore.ieee.org/document/7303876/, doi:10.1109/ACCESS.2015. formation Processing Systems (NIPS), pp. 2483–2493.
2494536. [211] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun,
[196] Preuser, T.B., Gambardella, G., Fraser, N., Blott, M., 2018. Inference Y., 2013. OverFeat: Integrated Recognition, Localization and Detec-
of quantized neural networks on heterogeneous all-programmable tion using Convolutional Networks, in: International Conference on
devices, in: 2018 Design, Automation & Test in Europe Conference Learning Representations(ICLR). URL: http://arxiv.org/abs/1312.
& Exhibition (DATE), IEEE. pp. 833–838. URL: http://ieeexplore. 6229.
ieee.org/document/8342121/, doi:10.23919/DATE.2018.8342121. [212] Settle, S.O., Bollavaram, M., D’Alberto, P., Delaye, E., Fernandez,
[197] Prost-Boucle, A., Bourge, A., Petrot, F., Alemdar, H., Caldwell, N., O., Fraser, N., Ng, A., Sirasao, A., Wu, M., 2018. Quantizing Convo-
Leroy, V., 2017. Scalable high-performance architecture for convolutional Neural Networks for Low-Power High-Throughput Inference
lutional ternary neural networks on FPGA, in: 2017 27th Interna- Engines. ArXiv preprint URL: http://arxiv.org/abs/1805.07941.
tional Conference on Field Programmable Logic and Applications [213] Shen, M., Han, K., Xu, C., Wang, Y., 2019. Searching for accu-
(FPL), IEEE. pp. 1–7. URL: https://hal.archives-ouvertes.fr/ rate binary neural architectures. Proceedings - 2019 International
hal-01563763http://ieeexplore.ieee.org/document/8056850/, doi:10. Conference on Computer Vision Workshop, ICCVW 2019 , 2041–
23919/FPL.2017.8056850. 2044doi:10.1109/ICCVW.2019.00256.
[198] Qin, H., Gong, R., Liu, X., Bai, X., Song, J., Sebe, N., [214] Shen, X., Yi, B., Zhang, Z., Shu, J., Liu, H., 2016. Automatic
2020a. Binary neural networks: A survey. Pattern Recognition Recommendation Technology for Learning Resources with Con-
105, 107281. URL: https://linkinghub.elsevier.com/retrieve/pii/ volutional Neural Network, in: Proceedings - 2016 International
S0031320320300856, doi:10.1016/j.patcog.2020.107281. Symposium on Educational Technology, ISET 2016, pp. 30–34.
[199] Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., Song, J., doi:10.1109/ISET.2016.12.
2020b. Forward and Backward Information Retention for Accu- [215] Sheng, T., Feng, C., Zhuo, S., Zhang, X., Shen, L., Aleksic, M.,
rate Binary Neural Networks, in: IEEE/CVF Conference on Com- 2018. A Quantization-Friendly Separable Convolution for Mo-
puter Vision and Pattern Recognition (CVPR), IEEE. pp. 2247–2256. bileNets. 2018 1st Workshop on Energy Efficient Machine Learn-
URL: https://ieeexplore.ieee.org/document/9157443/, doi:10.1109/ ing and Cognitive Computing for Embedded Applications (EMC2) ,
CVPR42600.2020.00232. 14–18URL: https://ieeexplore.ieee.org/document/8524017/, doi:10.
[200] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A., 2016. 1109/EMC2.2018.00011.
XNOR-Net: ImageNet Classification Using Binary Convolutional [216] Simons, T., Lee, D.J., 2019. A review of binarized neural networks.
Neural Networks, in: European Conference on Computer Vi- Electronics (Switzerland) 8. doi:10.3390/electronics8060661.
sion, Springer. pp. 525–542. URL: http://arxiv.org/abs/1603. [217] Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional
05279http://link.springer.com/10.1007/978-3-319-46493-0_32, Networks for Large-Scale Image Recognition, in: International
doi:10.1007/978-3-319-46493-0{\_}32. Conference on Learning Representations(ICLR), pp. 1–14. URL:
[201] Reed, R., 1993. Pruning Algorithms - A Survey. IEEE Transactions http://arxiv.org/abs/1409.1556.
on Neural Networks 4, 740–747. URL: http://ieeexplore.ieee.org/ [218] Singh, P., Kumar Verma, V., Rai, P., Namboodiri, V.P., 2019. Play
document/248452/, doi:10.1109/72.248452. and Prune: Adaptive Filter Pruning for Deep Model Compression, in:
[202] Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kep- Proceedings of the Twenty-Eighth International Joint Conference on
ner, J., 2019. Survey and Benchmarking of Machine Learning Ac- Artificial Intelligence, International Joint Conferences on Artificial
celerators, in: 2019 IEEE High Performance Extreme Computing Intelligence Organization, California. pp. 3460–3466. URL: https://
Conference (HPEC), IEEE. pp. 1–9. URL: https://ieeexplore.ieee. www.ijcai.org/proceedings/2019/480, doi:10.24963/ijcai.2019/480.
org/document/8916327/, doi:10.1109/HPEC.2019.8916327. [219] Society, I.C., Committee, M.S., 2008. IEEE Standard for Floating-
[203] Richard Chuang, Oliyide, O., Garrett, B., 2020. Introducing the Point Arithmetic. IEEE Std 754-2008 2008, 1–70. doi:10.1109/
Intel® Vision Accelerator Design with Intel® Arria® 10 FPGA. IEEESTD.2008.4610935.
White Paper . [220] Soudry, D., Hubara, I., Meir, R., 2014. Expectation backpropagation:
[204] Rodriguez, A., Segal, E., Meiri, E., Fomenko, E., Kim, Parameter-free training of multilayer neural networks with continuous
Y.J., Shen, H., 2018. Lower Numerical Precision Deep or discrete weights, in: Advances in Neural Information Processing
Learning Inference and Training. Intel White Paper , 1– Systems (NIPS), pp. 963–971. URL: https://dl.acm.org/doi/abs/
19URL: https://software.intel.com/sites/default/files/managed/ 10.5555/2968826.2968934.
db/92/Lower-Numerical-Precision-Deep-Learning-Jan2018.pdf. [221] Srinivas, S., Babu, R.V., 2015. Data-free parameter pruning for
[205] Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, Deep Neural Networks, in: Procedings of the British Machine Vi-
R., Gibson, N., Hegeman, J., Lele, M., Levenstein, R., Montgomery, sion Conference 2015, British Machine Vision Association. pp.
J., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., Smelyan- 1–31. URL: http://www.bmva.org/bmvc/2015/papers/paper031/index.
skiy, M., Wang, M., 2018. Glow: Graph lowering compiler techniques htmlhttp://arxiv.org/abs/1507.06149, doi:10.5244/C.29.31.
for neural networks. ArXiv preprint . [222] Srivastava, N., Hinton, G., . . . , A.K.T.j.o.m., 2014, U., 2014.
[206] Ruffy, F., Chahal, K., 2019. The State of Knowledge Distillation Dropout: a simple way to prevent neural networks from overfitting.
for Classification. ArXiv preprint URL: http://arxiv.org/abs/1912. The journal of machine learning research 15, 1929–1958. URL:
10850. http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.
[207] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, pdf?utm_content=buffer79b43&utm_medium=social&utm_source=
S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, twitter.com&utm_campaign=buffer, doi:10.5555/2627435.2670313.
A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recogni- [223] Sun, J., Luo, X., Gao, H., Wang, W., Gao, Y., Yang, X., 2020. Cate-
tion Challenge. International Journal of Computer Vision 115, 211– gorizing Malware via A Word2Vec-based Temporal Convolutional
252. URL: http://link.springer.com/10.1007/s11263-015-0816-y, Network Scheme. Journal of Cloud Computing 9. doi:10.1186/
doi:10.1007/s11263-015-0816-y. s13677-020-00200-y.
[208] Saad, D., Marom, E., 1990. Training Feed Forward Nets with Binary [224] Sun, M., Song, Z., Jiang, X., Pan, J., Pang, Y., 2017. Learning
Weights Via a Modified CHIR Algorithm. Complex Systems 4, 573– Pooling for Convolutional Neural Network. Neurocomputing 224, 96–
586. URL: https://www.complex-systems.com/pdf/04-5-5.pdf. 104. URL: http://dx.doi.org/10.1016/j.neucom.2016.10.049, doi:10.
[209] Sabour, S., Frosst, N., Hinton, G.E., 2017. Dynamic routing between 1016/j.neucom.2016.10.049.
capsules, in: Advances in Neural Information Processing Systems [225] Sze, V., Chen, Y.H.H., Yang, T.J.J., Emer, J.S., 2017. Efficient
(NIPS), pp. 3857–3867. Processing of Deep Neural Networks: A Tutorial and Survey. Pro-
[210] Santurkar, S., Tsipras, D., Ilyas, A., Madry, A., 2018. How does ceedings of the IEEE 105, 2295–2329. URL: http://ieeexplore.

T Liang et al.: Preprint submitted to Elsevier Page 38 of 41

Survey on pruning and quantization

ieee.org/document/8114708/, doi:10.1109/JPROC.2017.2761740. [243] Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J., 2016. Quantized Con-
[226] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., volutional Neural Networks for Mobile Devices, in: IEEE/CVF Con-
Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper ference on Computer Vision and Pattern Recognition (CVPR), IEEE.
with convolutions, in: Proceedings of the IEEE Computer Society pp. 4820–4828. URL: http://arxiv.org/abs/1512.06473http://
Conference on Computer Vision and Pattern Recognition, IEEE. pp. ieeexplore.ieee.org/document/7780890/, doi:10.1109/CVPR.2016.521.
1–9. URL: http://ieeexplore.ieee.org/document/7298594/, doi:10. [244] Wu, S., Li, G., Chen, F., Shi, L., 2018a. Training and Inference with
1109/CVPR.2015.7298594. Integers in Deep Neural Networks, in: International Conference on
[227] TansorFlow, . Fixed Point Quantization. URL: https://www. Learning Representations (ICLR). URL: http://arxiv.org/abs/1802.
tensorflow.org/lite/guide. 04680.
[228] Technologies, Q., 2019. Snapdragon Neural Processing Engine SDK. [245] Wu, S., Li, G., Deng, L., Liu, L., Wu, D., Xie, Y., Shi, L., 2019.
URL: https://developer.qualcomm.com/docs/snpe/index.html. L1-Norm Batch Normalization for Efficient Training of Deep Neural
[229] Tencent, 2019. NCNN is a high-performance neural network infer- Networks. IEEE Transactions on Neural Networks and Learning Sys-
ence framework optimized for the mobile platform. URL: https: tems 30, 2043–2051. URL: https://ieeexplore.ieee.org/abstract/
//github.com/Tencent/ncnn. document/8528524/https://ieeexplore.ieee.org/document/8528524/,
[230] Tishbirani, R., 1996. Regression shrinkage and selection via the doi:10.1109/TNNLS.2018.2876179.
Lasso. URL: https://statweb.stanford.edu/~tibs/lasso/lasso.pdf. [246] Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L.S., Grau-
[231] Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., man, K., Feris, R., 2018b. BlockDrop: Dynamic Inference Paths
Jahre, M., Vissers, K., 2016. FINN: A Framework for Fast, Scal- in Residual Networks, in: IEEE/CVF Conference on Computer
able Binarized Neural Network Inference. Proceedings of the 2017 Vision and Pattern Recognition (CVPR), IEEE. pp. 8817–8826.
ACM/SIGDA International Symposium on Field-Programmable Gate URL: https://ieeexplore.ieee.org/document/8579017/, doi:10.1109/
Arrays - FPGA ’17 , 65–74URL: http://dl.acm.org/citation.cfm? CVPR.2018.00919.
doid=3020078.3021744, doi:10.1145/3020078.3021744. [247] Xiaomi, 2019. MACE is a deep learning inference framework
[232] Vanholder, H., 2016. Efficient Inference with TensorRT. Technical optimized for mobile heterogeneous computing platforms. URL:
Report. https://github.com/XiaoMi/mace/.
[233] Vanhoucke, V., Senior, A., Mao, M.Z., 2011. Improving the speed [248] Xilinx, Inc, 2018. Accelerating DNNs with Xilinx Alveo Accelerator
of neural networks on CPUs URL: https://research.google/pubs/ Cards (WP504). White Paper 504, 1–11. URL: www.xilinx.com1.
pub37631/. [249] Xu, J., Huan, Y., Zheng, L.R., Zou, Z., 2019. A Low-Power Arith-
[234] Venieris, S.I., Kouris, A., Bouganis, C.S., 2018. Toolflows for Map- metic Element for Multi-Base Logarithmic Computation on Deep
ping Convolutional Neural Networks on FPGAs. ACM Comput- Neural Networks, in: International System on Chip Conference, IEEE.
ing Surveys 51, 1–39. URL: http://dl.acm.org/citation.cfm?doid= pp. 260–265. URL: https://ieeexplore.ieee.org/document/8618560/,
3212709.3186332, doi:10.1145/3186332. doi:10.1109/SOCC.2018.8618560.
[235] Venkatesh, G., Nurvitadhi, E., Marr, D., 2017. Accelerating [250] Xu, S., Huang, A., Chen, L., Zhang, B., 2020. Convolutional Neural
Deep Convolutional Networks using low-precision and sparsity, Network Pruning: A Survey, in: 2020 39th Chinese Control Confer-
in: 2017 IEEE International Conference on Acoustics, Speech ence (CCC), IEEE. pp. 7458–7463. URL: https://ieeexplore.ieee.
and Signal Processing (ICASSP), IEEE. pp. 2861–2865. URL: org/document/9189610/, doi:10.23919/CCC50068.2020.9189610.
https://arxiv.org/pdf/1610.00324.pdfhttp://ieeexplore.ieee.org/ [251] Xu, X., Lu, Q., Yang, L., Hu, S., Chen, D., Hu, Y., Shi, Y., 2018a.
document/7952679/, doi:10.1109/ICASSP.2017.7952679. Quantization of Fully Convolutional Networks for Accurate Biomed-
[236] Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S., 2019a. HAQ: Hardware- ical Image Segmentation, in: Proceedings of the IEEE/CVF Con-
Aware Automated Quantization With Mixed Precision, in: 2019 ference on Computer Vision and Pattern Recognition (CVPR), pp.
IEEE/CVF Conference on Computer Vision and Pattern Recognition 8300–8308. doi:10.1109/CVPR.2018.00866.
(CVPR), IEEE. pp. 8604–8612. URL: http://arxiv.org/abs/1811. [252] Xu, Z., Hsu, Y.C., Huang, J., 2018b. Training shallow and thin net-
08886https://ieeexplore.ieee.org/document/8954415/, doi:10.1109/ works for acceleration via knowledge distillation with conditional
CVPR.2019.00881. adversarial networks, in: International Conference on Learning Rep-
[237] Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K., 2018a. resentations (ICLR) - Workshop.
Training deep neural networks with 8-bit floating point numbers, in: [253] Yang, J., Shen, X., Xing, J., Tian, X., Li, H., Deng, B., Huang, J., Hua,
Advances in Neural Information Processing Systems (NIPS), pp. X.s., 2019. Quantization Networks, in: 2019 IEEE/CVF Conference
7675–7684. on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
[238] Wang, P., Cheng, J., 2017. Fixed-Point Factorized Networks, in: 7300–7308. URL: https://ieeexplore.ieee.org/document/8953531/,
IEEE/CVF Conference on Computer Vision and Pattern Recognition doi:10.1109/CVPR.2019.00748.
(CVPR), IEEE. pp. 3966–3974. URL: http://ieeexplore.ieee.org/ [254] Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., Li, G., 2020. Training
document/8099905/, doi:10.1109/CVPR.2017.422. high-performance and large-scale deep neural networks with full 8-bit
[239] Wang, P., Hu, Q., Zhang, Y., Zhang, C., Liu, Y., Cheng, J., 2018b. integers. Neural Networks 125, 70–82. doi:10.1016/j.neunet.2019.
Two-Step Quantization for Low-bit Neural Networks, in: Proceed- 12.027.
ings of the IEEE/CVF Conference on Computer Vision and Pattern [255] Ye, J., Lu, X., Lin, Z., Wang, J.Z., 2018. Rethinking the Smaller-
Recognition (CVPR), pp. 4376–4384. doi:10.1109/CVPR.2018.00460. Norm-Less-Informative Assumption in Channel Pruning of Convolu-
[240] Wang, Z., Lu, J., Tao, C., Zhou, J., Tian, Q., 2019b. Learning tion Layers. ArXiv preprint URL: http://arxiv.org/abs/1802.00124.
channel-wise interactions for binary convolutional neural networks, [256] Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J., 2019.
in: Proceedings of the IEEE/CVF Conference on Computer Vision Blended coarse gradient descent for full quantization of deep neu-
and Pattern Recognition (CVPR), pp. 568–577. doi:10.1109/CVPR. ral networks. Research in Mathematical Sciences 6. doi:10.1007/
2019.00066. s40687-018-0177-6.
[241] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H., 2016. Learning [257] Yogatama, D., Mann, G., 2014. Efficient Transfer Learning Method
Structured Sparsity in Deep Neural Networks, in: Advances in for Automatic Hyperparameter Tuning, in: Kaski, S., Corander, J.
Neural Information Processing Systems (NIPS), IEEE. pp. 2074– (Eds.), Proceedings of the Seventeenth International Conference on
2082. URL: https://dl.acm.org/doi/abs/10.5555/3157096.3157329, Artificial Intelligence and Statistics, PMLR, Reykjavik, Iceland. pp.
doi:10.1016/j.ccr.2008.06.009. 1077–1085. URL: http://proceedings.mlr.press/v33/yogatama14.
[242] Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P., 2020. Integer html.
quantization for deep learning inference: Principles and empirical [258] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.,
evaluation. ArXiv preprint , 1–20. 2017. Scalpel: Customizing DNN pruning to the underlying hardware

T Liang et al.: Preprint submitted to Elsevier Page 39 of 41

Survey on pruning and quantization

parallelism. ACM SIGARCH Computer Architecture News 45, 548– Low Bitwidth Gradients. ArXiv preprint abs/1606.0, 1–13. URL:
560. URL: http://dl.acm.org/citation.cfm?doid=3140659.3080215, https://arxiv.org/abs/1606.06160.
doi:10.1145/3140659.3080215. [273] Zhou, S.C., Wang, Y.Z., Wen, H., He, Q.Y., Zou, Y.H., 2017b. Bal-
[259] Yu, J., Yang, L., Xu, N., Yang, J., Huang, T., 2018. Slimmable Neu- anced Quantization: An Effective and Efficient Approach to Quan-
ral Networks, in: International Conference on Learning Representa- tized Neural Networks. Journal of Computer Science and Technology
tions(ICLR), International Conference on Learning Representations, 32, 667–682. doi:10.1007/s11390-017-1750-y.
ICLR. pp. 1–12. URL: http://arxiv.org/abs/1812.08928. [274] Zhu, C., Han, S., Mao, H., Dally, W.J., 2017. Trained Ternary Quan-
[260] Yuan, M., Lin, Y., 2006. Model selection and estimation tization, in: International Conference on Learning Representations
in regression with grouped variables. Journal of the Royal (ICLR), pp. 1–10. URL: http://arxiv.org/abs/1612.01064.
Statistical Society: Series B (Statistical Methodology) 68, 49– [275] Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X., Yan, J.,
67. URL: http://doi.wiley.com/10.1111/j.1467-9868.2005.00532.x, . Towards Unified INT8 Training for Convolutional Neural Network,
doi:10.1111/j.1467-9868.2005.00532.x. in: Proceedings of the IEEE/CVF Conference on Computer Vision
[261] Yuan, Z., Hu, J., Wu, D., Ban, X., 2020. A dual-attention recurrent and Pattern Recognition (CVPR). URL: http://arxiv.org/abs/1912.
neural network method for deep cone thickener underflow concen- 12607.
tration prediction. Sensors (Switzerland) 20, 1–18. doi:10.3390/ [276] Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I., 2019. Structured
s20051260. binary neural networks for accurate image classification and semantic
[262] Zhang, D., Yang, J., Ye, D., Hua, G., 2018. LQ-Nets: Learned segmentation. Proceedings of the IEEE Computer Society Conference
quantization for highly accurate and compact deep neural networks, on Computer Vision and Pattern Recognition 2019-June, 413–422.
in: Lecture Notes in Computer Science (including subseries Lecture doi:10.1109/CVPR.2019.00050.
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), [277] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2017. Learning
pp. 373–390. doi:10.1007/978-3-030-01237-3{\_}23. Transferable Architectures for Scalable Image Recognition. Pro-
[263] Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B., 2019a. ceedings of the IEEE/CVF Conference on Computer Vision and
Recent Advances in Convolutional Neural Network Acceleration. Pattern Recognition (CVPR) , 8697–8710URL: https://ieeexplore.
Neurocomputing 323, 37–51. URL: https://linkinghub.elsevier. ieee.org/abstract/document/8579005/.
com/retrieve/pii/S0925231218311007, doi:10.1016/j.neucom.2018.09.
038.
[264] Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q.,
Chen, T., Chen, Y., 2016a. Cambricon-X: An accelerator for
sparse neural networks, in: 2016 49th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture (MICRO), IEEE. pp. 1–12.
URL: http://ieeexplore.ieee.org/document/7783723/, doi:10.1109/
MICRO.2016.7783723.
[265] Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R.,
Bengio, Y., 2016b. Architectural complexity measures of recurrent
neural networks, in: Advances in Neural Information Processing
Systems (NIPS), pp. 1830–1838.
[266] Zhang, Y., Zhao, C., Ni, B., Zhang, J., Deng, H., 2019b. Exploiting
Channel Similarity for Accelerating Deep Convolutional Neural Net-
works. ArXiv preprint , 1–14URL: http://arxiv.org/abs/1908.02620.
[267] Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.H., Srivastava, M.,
Gupta, R., Zhang, Z., 2017. Accelerating Binarized Convolutional
Neural Networks with Software-Programmable FPGAs, in: Proceed-
ings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays - FPGA ’17, ACM Press, New York, New
York, USA. pp. 15–24. URL: http://dl.acm.org/citation.cfm?doid=
3020078.3021741, doi:10.1145/3020078.3021741.
[268] Zhong, K., Zhao, T., Ning, X., Zeng, S., Guo, K., Wang, Y., Yang, H.,
2020. Towards Lower Bit Multiplication for Convolutional Neural
Network Training. ArXiv preprint URL: http://arxiv.org/abs/2006.
02804.
[269] Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y., 2017a.
Incremental Network Quantization: Towards Lossless
CNNs with Low-Precision Weights, in: International
Conference on Learning Representations(ICLR). URL:
https://github.com/Zhouaojun/Incremental-http://arxiv.org/
abs/1702.03044http://cn.arxiv.org/pdf/1702.03044.pdf.
[270] Zhou, H., Alvarez, J.M., Porikli, F., 2016a. Less Is More: To-
wards Compact CNNs, in: European Conference on Computer
Vision, pp. 662–677. URL: https://link.springer.com/chapter/
10.1007/978-3-319-46493-0_40http://link.springer.com/10.1007/
978-3-319-46493-0_40, doi:10.1007/978-3-319-46493-0{\_}40.
[271] Zhou, S., Kannan, R., Prasanna, V.K., 2018. Accelerating low rank
matrix completion on FPGA, in: 2017 International Conference on
Reconfigurable Computing and FPGAs, ReConFig 2017, IEEE. pp.
1–7. URL: http://ieeexplore.ieee.org/document/8279771/, doi:10.
1109/RECONFIG.2017.8279771.
[272] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y., 2016b. DoReFa-
Net: Training Low Bitwidth Convolutional Neural Networks with

T Liang et al.: Preprint submitted to Elsevier Page 40 of 41

Survey on pruning and quantization

Tailin Liang received the B.E. degree in Computer

Science and B.B.A from the University of Science
and Technology Beijing in 2017. He is currently
working toward a Ph.D. degree in Computer Sci-
ence at the School of Computer and Communication
Engineering, University of Science and Technol-
ogy Beijing. His current research interests include
deep learning domain-specific processors and co-
designed optimization algorithms.

John Glossner received the Ph.D. degree in Elec-

trical Engineering from TU Delft in 2001. He is
the Director of the Computer Architecture, Hetero-
geneous Computing, and AI Lab at the University
of Science and Technology Beijing. He is also the
CEO of Optimum Semiconductor Technologies and
President of both the Heterogeneous System Archi-
tecture Foundation and Wireless Innovation Forum.
John’s research interests include the design of het-
erogeneous computing systems, computer architec-
ture, embedded systems, digital signal processors,
software defined radios, artificial intelligence algo-
rithms, and machine learning systems.

Lei Wang received the B.E. and Ph.D. degrees in

2006 and 2012 from the University of Science and
Technology Beijing. He then served as an assistant
researcher at the Institute of Automation of the Chi-
nese Academy of Sciences during 2012-2015. He
was a joint Ph.D. of Electronic Engineering at The
University of Texas at Dallas during 2009-2011.
Currently, he is an adjunct professor at the School
of Computer and Communication Engineering, Uni-
versity of Science and Technology Beijing.

Shaobo Shi received the B.E. and Ph.D. degrees

in 2008 and 2014 from the University of Science
and Technology Beijing. He then served as an assis-
tant researcher at the Institute of Automation of the
Chinese Academy of Sciences during 2014-2017.
Currently, he is a deep learning domain-specific
processor engineer at Huaxia General Processor
Technology. As well serve as an adjunct professor
at the School of Computer and Communication En-
gineering, University of Science and Technology
Beijing.

Xiaotong Zhang received the M.E. and Ph.D. de-

grees from the University of Science and Technol-
ogy Beijing in 1997 and 2000, respectively, where
he was a professor of Computer Science and Tech-
nology. His research interest includes the quality of
wireless channels and networks, wireless sensor net-
works, networks management, cross-layer design
and resource allocation of broadband and wireless
networks, and the signal processing of communica-
tion and computer architecture.

T Liang et al.: Preprint submitted to Elsevier Page 41 of 41

Philip H. Pollock III - Barry C. Edwards - An R Companion To Political Analysis-CQ Press (2017)
No ratings yet
Philip H. Pollock III - Barry C. Edwards - An R Companion To Political Analysis-CQ Press (2017)
493 pages
A Survey of Quantization Methods For Efficient Neural Network Inference
No ratings yet
A Survey of Quantization Methods For Efficient Neural Network Inference
33 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
An Survey of Neural Network Compression
No ratings yet
An Survey of Neural Network Compression
73 pages
Applsci 12 11184
No ratings yet
Applsci 12 11184
18 pages
Efficient Deep Learning in Network Compression and
No ratings yet
Efficient Deep Learning in Network Compression and
21 pages
Compressing Deep Convolutional Networks
No ratings yet
Compressing Deep Convolutional Networks
10 pages
인공지능 하드웨어 연구를 위한 딥러닝 기초 및 경량화 기술 (Hw 이론)
No ratings yet
인공지능 하드웨어 연구를 위한 딥러닝 기초 및 경량화 기술 (Hw 이론)
96 pages
Paper 8
No ratings yet
Paper 8
7 pages
Learning Efficient Convolutional Networks Through Network Slimming
No ratings yet
Learning Efficient Convolutional Networks Through Network Slimming
10 pages
Squeeze Net
No ratings yet
Squeeze Net
13 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
No ratings yet
Zehao Huang Data-Driven Sparse Structure ECCV 2018 Paper
17 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
The Kmean Quatization
No ratings yet
The Kmean Quatization
14 pages
DL Presentation
No ratings yet
DL Presentation
20 pages
Compression Survey Hal
No ratings yet
Compression Survey Hal
26 pages
Solodskikh Integral Neural Networks CVPR 2023 Paper
No ratings yet
Solodskikh Integral Neural Networks CVPR 2023 Paper
10 pages
Building Efficient Lightweight CNN Models
No ratings yet
Building Efficient Lightweight CNN Models
25 pages
CS601 Machine Learning Unit 3
No ratings yet
CS601 Machine Learning Unit 3
47 pages
Network Compression and Speedup: Shuochao Yao, Yiwen Xu, Daniel Calzada
No ratings yet
Network Compression and Speedup: Shuochao Yao, Yiwen Xu, Daniel Calzada
70 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Mathematics 12 03032
No ratings yet
Mathematics 12 03032
19 pages
Convolutional Neural Networks: CS 535 Deep Learning, Winter 2020 Fuxin Li
No ratings yet
Convolutional Neural Networks: CS 535 Deep Learning, Winter 2020 Fuxin Li
44 pages
Szegedy Rethinking The Inception CVPR 2016 Paper PDF
No ratings yet
Szegedy Rethinking The Inception CVPR 2016 Paper PDF
9 pages
Improved Regularization of Convolutional Neural Networks With Cutout
No ratings yet
Improved Regularization of Convolutional Neural Networks With Cutout
8 pages
Deep Model Compression Based On The Trai
No ratings yet
Deep Model Compression Based On The Trai
9 pages
CNN Project
No ratings yet
CNN Project
16 pages
Anthony
No ratings yet
Anthony
33 pages
Convolutional Neural Networks - Deeplearning-Notes
No ratings yet
Convolutional Neural Networks - Deeplearning-Notes
43 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
Aiml Ece Unit-5
No ratings yet
Aiml Ece Unit-5
48 pages
02 - Introduction To Convolutional Neural Networks (CNNS)
No ratings yet
02 - Introduction To Convolutional Neural Networks (CNNS)
28 pages
COMP3220 Lect 11 - Introduction To Convolutional Neural Networks
No ratings yet
COMP3220 Lect 11 - Introduction To Convolutional Neural Networks
13 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
DL Unit - 5
No ratings yet
DL Unit - 5
14 pages
UNIT-III DLL Full Unit
No ratings yet
UNIT-III DLL Full Unit
63 pages
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
No ratings yet
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
17 pages
Convolutional Neural Networks in Python - DataCamp
No ratings yet
Convolutional Neural Networks in Python - DataCamp
22 pages
An Introduction To Convolutional Neural Networks: November 2015
No ratings yet
An Introduction To Convolutional Neural Networks: November 2015
12 pages
An Introduction To Convolutional Neural Networks: November 2015
No ratings yet
An Introduction To Convolutional Neural Networks: November 2015
12 pages
03 Convolution Neural Networks and Computer Vision With Tensorflow
No ratings yet
03 Convolution Neural Networks and Computer Vision With Tensorflow
21 pages
Google Le Net
No ratings yet
Google Le Net
9 pages
Going Deeper With Convolutions: Wliu@cs - Unc.edu, Reedscott@umich - Edu
No ratings yet
Going Deeper With Convolutions: Wliu@cs - Unc.edu, Reedscott@umich - Edu
9 pages
Dhouibi 2021
No ratings yet
Dhouibi 2021
6 pages
Faraone 2018
No ratings yet
Faraone 2018
4 pages
Model Compression Techniquesin Deep Learning
No ratings yet
Model Compression Techniquesin Deep Learning
23 pages
CNN and Autoencoder
No ratings yet
CNN and Autoencoder
56 pages
NIPS 2016 Dynamic Network Surgery For Efficient Dnns Paper
No ratings yet
NIPS 2016 Dynamic Network Surgery For Efficient Dnns Paper
9 pages
A Comprehensive Survey On Model Compression and Acceleration
No ratings yet
A Comprehensive Survey On Model Compression and Acceleration
43 pages
DNN Overfit 2024
No ratings yet
DNN Overfit 2024
19 pages
Channelnets Compact and Efficient Convolutional Neural Networks Via Channel Wise Convolutions
No ratings yet
Channelnets Compact and Efficient Convolutional Neural Networks Via Channel Wise Convolutions
9 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
No ratings yet
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
15 pages
An Overview of Convolutional Neural Network Architectures For Deep Learning
No ratings yet
An Overview of Convolutional Neural Network Architectures For Deep Learning
22 pages
UNIT-III Convolution Neural Networks
No ratings yet
UNIT-III Convolution Neural Networks
9 pages
Google Net
100% (1)
Google Net
9 pages
Weave Networking for Cloud-Native Infrastructure: Definitive Reference for Developers and Engineers
From Everand
Weave Networking for Cloud-Native Infrastructure: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
From Everand
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Com - Wd.clan Logcat
No ratings yet
Com - Wd.clan Logcat
5 pages
MathType 2019
No ratings yet
MathType 2019
7 pages
GIS Assignment-I
No ratings yet
GIS Assignment-I
4 pages
21s2m4b3 - SCD5200 Remote Terminal Viewer (RTV) Diagnostics Utility
No ratings yet
21s2m4b3 - SCD5200 Remote Terminal Viewer (RTV) Diagnostics Utility
12 pages
Mms Record Work111
No ratings yet
Mms Record Work111
55 pages
Studer Vista V Flyer
No ratings yet
Studer Vista V Flyer
4 pages
Excel The Smart Way 51 Tips Ebook Final
No ratings yet
Excel The Smart Way 51 Tips Ebook Final
107 pages
CBLM Perform Computer PDF Free
No ratings yet
CBLM Perform Computer PDF Free
20 pages
Interaction Design Notes by Jennifer Preece
No ratings yet
Interaction Design Notes by Jennifer Preece
54 pages
Microproject Report On Computer Networking and Data Communication
No ratings yet
Microproject Report On Computer Networking and Data Communication
20 pages
DF-Q Manual
No ratings yet
DF-Q Manual
7 pages
HowTo - LAS LiDAR Into C3D
No ratings yet
HowTo - LAS LiDAR Into C3D
6 pages
Chapter 4 Multimedia
No ratings yet
Chapter 4 Multimedia
57 pages
IT Essentials 8 - Labs and PT Labs Index
No ratings yet
IT Essentials 8 - Labs and PT Labs Index
4 pages
Name: . Stream : S 850/1 Subsidiary Information & Communication Technology Paper 1 2 Hours
100% (1)
Name: . Stream : S 850/1 Subsidiary Information & Communication Technology Paper 1 2 Hours
6 pages
Тренування
No ratings yet
Тренування
10 pages
06 - Rtos Eos-Concise
No ratings yet
06 - Rtos Eos-Concise
22 pages
Deep Learning Approach For Ethiopian Banknote Denomination Classification and Fake Detection System
No ratings yet
Deep Learning Approach For Ethiopian Banknote Denomination Classification and Fake Detection System
8 pages
Chameleon Chip
No ratings yet
Chameleon Chip
26 pages
Www-Wikihow-Com-Us
No ratings yet
Www-Wikihow-Com-Us
10 pages
Unidad de Control Del GS Lobato
No ratings yet
Unidad de Control Del GS Lobato
2 pages
Unit 1
No ratings yet
Unit 1
14 pages
Python Report
No ratings yet
Python Report
19 pages
Ian QUIZ-CC100
No ratings yet
Ian QUIZ-CC100
2 pages
CypCar User Manual
No ratings yet
CypCar User Manual
77 pages
Questions
No ratings yet
Questions
2 pages
Experiment No 3 (MP)
No ratings yet
Experiment No 3 (MP)
9 pages
IT Workshop Lab Manual
No ratings yet
IT Workshop Lab Manual
114 pages
R01us0398ej0116 RZ G
No ratings yet
R01us0398ej0116 RZ G
41 pages

Pruning and Quantization For Deep Neural Network Acceleration: A Survey

Uploaded by

Pruning and Quantization For Deep Neural Network Acceleration: A Survey

Uploaded by

Pruning and Quantization for Deep Neural Network Acceleration: A

ARTICLE INFO ABSTRACT

1. Introduction connections between neurons. Feed forward layers reduce

T Liang et al.: Preprint submitted to Elsevier Page 1 of 41

CNN Acceleration [40, 39, 142, 137, 194, 263, 182]

Network Optimization Hardware Accelerator [151, 202]

T Liang et al.: Preprint submitted to Elsevier Page 2 of 41

techniques and frameworks. • Kernel (𝐤 ∈ ℝ𝑘1 ×𝑘2 ) - Convolutional coefficients for a

T Liang et al.: Preprint submitted to Elsevier Page 3 of 41

Table 1 Standard Convolution

ReLU Rectified Linear Unit Features 15 24 17 26

FPGA Field Programmable Gate Array 2 0 1 3 3 1 3 2 2 1 1 3 1 2 20 24

Equation 1 shows the layer-wise mathematical representa-

T Liang et al.: Preprint submitted to Elsevier Page 4 of 41

as in the top of Figure 5. However, the number of computa-

Figure 4: Fully Connected Layer: Each node in a layer connects ………

Figure 4 shows a FCL - also called dense layer or dense

Figure 5: Inception Block: The inception block computes 𝐱−𝜇

T Liang et al.: Preprint submitted to Elsevier Page 5 of 41

weights. If the features in one channel share the same parame-

T Liang et al.: Preprint submitted to Elsevier Page 6 of 41

3.1.1. Pruning Criteria

T Liang et al.: Preprint submitted to Elsevier Page 7 of 41

T Liang et al.: Preprint submitted to Elsevier Page 8 of 41

activation map. 𝑓 (true) = 1 and 𝑓 (false) = 0. channel-wise W:,c

If the top activation channel in the standard convolution of

structure in the network. In this way, the ratio of pruned to ⎧‖ ‖2 ⎫

T Liang et al.: Preprint submitted to Elsevier Page 9 of 41

T Liang et al.: Preprint submitted to Elsevier Page 10 of 41

T Liang et al.: Preprint submitted to Elsevier Page 11 of 41

Exit Exit Exit Exit Exit

Figure 11: Dynamic Pruning System Considerations

T Liang et al.: Preprint submitted to Elsevier Page 12 of 41

T Liang et al.: Preprint submitted to Elsevier Page 13 of 41

T Liang et al.: Preprint submitted to Elsevier Page 14 of 41

cluster/ post train quantize-aware quantize-aware

T Liang et al.: Preprint submitted to Elsevier Page 15 of 41

from 𝑅 in Equation 15, while 𝑠𝑓 = (𝑛 − 1)∕𝐹𝑚𝑎𝑥 , 𝑠𝑤 = 4.2. Quantization Methodology

T Liang et al.: Preprint submitted to Elsevier Page 16 of 41

T Liang et al.: Preprint submitted to Elsevier Page 17 of 41

T Liang et al.: Preprint submitted to Elsevier Page 18 of 41

keeping computational complexity similar to BNN’s. Ternary measure problem

T Liang et al.: Preprint submitted to Elsevier Page 19 of 41

T Liang et al.: Preprint submitted to Elsevier Page 20 of 41

Alpha-Blending (AB) [162] was proposed as a replace- input layer

(a) Straight-through Estimator (b) Alpha-Blending Approach

Figure 15: Mixed Precision Training [172]: FP16 is applied

In Section 4.3.2, we discuss deep learning libraries and frame-

T Liang et al.: Preprint submitted to Elsevier Page 21 of 41

T Liang et al.: Preprint submitted to Elsevier Page 22 of 41

𝐎𝑓 32 = 𝐖𝑓 32 ∗ 𝐅𝑓 32 + 𝐛𝑓 32 Ristretto [90] is a tool for Caffe quantization. It uses re-

T Liang et al.: Preprint submitted to Elsevier Page 23 of 41

Programmable Hardware: Quantized networks with less

T Liang et al.: Preprint submitted to Elsevier Page 24 of 41

T Liang et al.: Preprint submitted to Elsevier Page 25 of 41

Cambricon TPU3 Systems

T Liang et al.: Preprint submitted to Elsevier Page 26 of 41

T Liang et al.: Preprint submitted to Elsevier Page 27 of 41

T Liang et al.: Preprint submitted to Elsevier Page 28 of 41

quantized typically to 8-bit integers. Lower bit width quan-

T Liang et al.: Preprint submitted to Elsevier Page 29 of 41

Model Deployment W A Top-1 Top-5 Ref.

T Liang et al.: Preprint submitted to Elsevier Page 30 of 41

T Liang et al.: Preprint submitted to Elsevier Page 31 of 41

References jection for Non-Uniform Quantization of Neural Networks. arXiv

T Liang et al.: Preprint submitted to Elsevier Page 32 of 41

T Liang et al.: Preprint submitted to Elsevier Page 33 of 41

T Liang et al.: Preprint submitted to Elsevier Page 34 of 41

T Liang et al.: Preprint submitted to Elsevier Page 35 of 41

T Liang et al.: Preprint submitted to Elsevier Page 36 of 41

//arxiv.org/abs/1810.05270. 1–17. URL: http://arxiv.org/abs/1611.06440.

T Liang et al.: Preprint submitted to Elsevier Page 37 of 41

T Liang et al.: Preprint submitted to Elsevier Page 38 of 41

T Liang et al.: Preprint submitted to Elsevier Page 39 of 41

T Liang et al.: Preprint submitted to Elsevier Page 40 of 41

Tailin Liang received the B.E. degree in Computer