Pruning and Quantization For Deep Neural Network Acceleration: A Survey
Pruning and Quantization For Deep Neural Network Acceleration: A Survey
Survey
Tailin Lianga,b , John Glossnera,b,c , Lei Wanga , Shaobo Shia,b and Xiaotong Zhanga,∗
a Schoolof Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
b Hua Xia General Processor Technologies, Beijing 100080, China
c General Processor Technologies, Tarrytown, NY 10591, United States
neural network acceleration deployment and require significant computation resources and energy costs. These challenges can
neural network quantization be overcome through optimizations such as network compression. Network compression can often
neural network pruning be realized with little loss of accuracy. In some cases accuracy may even improve. This paper
low-bit mathematics provides a survey on two types of network compression: pruning and quantization. Pruning can be
categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare
pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs
in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning.
Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and
activations may be quantized typically to 8-bit integers although lower bit width implementations
are also discussed including binary neural networks. Both pruning and quantization can be used
independently or combined. We compare current techniques, analyze their strengths and weaknesses,
present compressed network accuracy results on a number of frameworks, and provide practical
guidance for compressing networks.
Figure 1: CNN Acceleration Approaches: Follow the sense from designing to implementing, CNN acceleration could fall into three
categories, structure design (or generation), further optimization, and specialized hardware.
related to general pruning techniques. include: General Processor Technologies (GPT) [179], ARM,
Parameter factorization is a technique that decomposes nVidia, and 60+ others [202] all have processors targeting
higher-rank tensors into lower-rank tensors simplifying mem- this space. ASICs may also target both training and inference
ory access and compressing model size. It works by breaking in datacenters. Tensor processing units (TPU) from Google
large layers into many smaller ones, thereby reducing the [125], Habana from Intel [169], Kunlun from Baidu [191],
number of computations. It can be applied to both convolu- Hanguang from Alibaba [124], and Intelligence Processing
tional and fully connected layers. This technique can also be Unit (IPU) from Graphcore [121].
applied with pruning and quantization. Programmable reconfigurable FPGAs have been used for
Network pruning [201, 24, 12, 250] involves removing neural network acceleration [86, 3, 234, 152]. FPGAs are
parameters that don’t impact network accuracy. Pruning can widely used by researchers due to long ASIC design cycles.
be performed in many ways and is described extensively in Neural network libraries are available from Xilinx [128] and
Section 3. Intel [69]. Specific neural network accelerators are also being
Network quantization [131, 87] involves replacing datatypes integrated into FPGA fabrics [248, 4, 203]. Because FPGAs
with reduced width datatypes. For example, replacing 32-bit operate at the gate level, they are often used in low-bit width
Floating Point (FP32) with 8-bit Integers (INT8). The val- and binary neural networks [178, 267, 197].
ues can often be encoded to preserve more information than Neural network specific optimizations are typically in-
simple conversion. Quantization is described extensively in corporated into custom ASIC hardware. Lookup tables can
Section 4. be used to accelerate trigonometric activation functions [46]
Hardware accelerators [151, 202] are designed primarily or directly generate results for low bit-width arithmetic [65],
for network acceleration. At a high level they encompass partial products can be stored in special registers and reused
entire processor platforms and often include hardware opti- [38], and memory access ordering with specialized address-
mized for neural networks. Processor platforms include spe- ing hardware can all reduce the number of cycles to compute
cialized Central Processing Unit (CPU) instructions, Graph- a neural network output [126]. Hardware accelerators are
ics Processing Units (GPUs), Application Specific Integrated not the primary focus of this paper. However, we do note
Circuits (ASICs), and Field Programmable Gate Arrays (FP- hardware implementations that incorporate specific accelera-
GAs). tion techniques. Further background information on efficient
CPUs have been optimized with specialized Artificial processing and hardware implementations of DNNs can be
Intelligence (AI) instructions usually within specialized Sin- found in [225].
gle Instruction Multiple Data (SIMD) units [49, 11]. While We summarize our main contributions as follows:
CPUs can be used for training, they have primarily been used
for inference in systems that do not have specialized inference • We provide a review of two network compression tech-
accelerators. niques: pruning and quantization. We discuss methods
GPUs have been used for both training and inference. of compression, mathematical formulations, and com-
nVidia has specialized tensor units incorporated into their pare current State-Of-The-Art (SOTA) compression
GPUs that are optimized for neural network acceleration methods.
[186]. AMD [7], ARM [10], and Imagination [117] also • We classify pruning techniques into static and dynamic
have GPUs with instructions for neural network acceleration. methods, depending if they are done offline or at run-
Specialized ASICs have also been designed for neural time, respectively.
network acceleration. They typically target inference at the
edge, in security cameras, or on mobile devices. Examples • We analyze and quantitatively compare quantization
RL Reinforcement Learning
RNN Recurrent Neural Network Kernels
1 1 1 1 0 1 1 0 2 1 1 2 1 1
2 2 1 1 1 0 0 1 2 1 2 0 1 0
SGD Stochastic Gradient Descent
2 0
STE Straight-Through Estimator 2 1
Input 1 2 0 0 2 1 1 2 0
1 2
ASIC Application Specific Integrated Circuit Features 1 1 3 0 3 2 0 1 3
1 1
AVX-512 Advance Vector Extension 512 (Activations) 0 2 2 1 1 0 3 3 2
1 2
CPU Central Processing Unit 1 1
CU Computing Unit 1 2 1 1 0 2 0 3 1 2 0 1 0 1 14 12
𝐖 𝐛−𝜇
𝐖′ = 𝛾 ⋅ √ , 𝐛′ = 𝛾 ⋅ √ +𝛽 (3)
𝜎2 + 𝜖 𝜎2 + 𝜖
2.6. Pooling
Figure 7: Popular CNN Models: Top-1 accuracy vs GFLOPs
Pooling was first published in the 1980s with neocogni-
and model size, adopted from [23]
tron [71]. The technique takes a group of values and reduces
them to a single value. The selection of the single replace-
ment value can be computed as an average of the values
(average pooling) or simply selecting the maximum value Execution time was not a factor. This incentivized neural
(max pooling). network designs with significant redundancy. As of 2020,
Pooling destroys spatial information as it is a form of models with more than 175 billion parameters have been
down-sampling. The window size defines the area of values to published [26].
be pooled. For image processing it is usually a square window Networks that execute in data centers can accommodate
with typical sizes being 2 × 2, 3 × 3 or 4 × 4. Small windows models with a large number of parameters. In resource con-
allow enough information to be propagated to successive strained environments such as edge and mobile deployments,
layers while reducing the total number of computations [224]. reduced parameter models have been designed. For exam-
Global pooling is a technique where, instead of reducing ple, GoogLeNet [226] achieves similar top-1 accuracy of
a neighborhood of values, an entire feature map is reduced to 69.78% as VGG-16 but with only 7 million parameters. Mo-
a single value [154]. Global Average Pooling (GAP) extracts bileNet [105] has 70% top-1 accuracy with only 4.2 million
information from multi-channel features and can be used with parameters and only 1.14 Giga FLoating-point OPerations
dynamic pruning [153, 42]. (GFLOPs). A more detailed network comparison can be
Capsule structures have been proposed as an alternative found in [5].
to pooling. Capsule networks replace the scalar neuron with
vectors. The vectors represent a specific entity with more 3. Pruning
detailed information, such as position and size of an object.
Capsule networks void loss of spatial information by captur- Network pruning is an important technique for both mem-
ing it in the vector representation. Rather than reducing a ory size and bandwidth reduction. In the early 1990s, pruning
neighborhood of values to a single value, capsule networks techniques were developed to reduce a trained large network
perform a dynamic routing algorithm to remove connections into a smaller network without requiring retraining [201].
[209]. This allowed neural networks to be deployed in constrained
environments such as embedded systems. Pruning removes
2.7. Parameters redundant parameters or neurons that do not significantly
Figure 7 show top-1 accuracy percent verses the number contribute to the accuracy of results. This condition may
of operations needed for a number of popular neural networks arise when the weight coefficients are zero, close to zero,
[23]. The number of parameters in each network is repre- or are replicated. Pruning consequently reduces the com-
sented by the size of the circle. A trend (not shown in the putational complexity. If pruned networks are retrained it
figure) is a yearly increase in parameter complexity. In 2012, provides the possibility of escaping a previous local minima
AlexNet [133] was published with 60 million parameters. In [43] and further improve accuracy.
2013, VGG [217] was introduced with 133 million parameters Research on network pruning can roughly be categorized
and achieved 71.1% top-1 accuracy. These were part of the as sensitivity calculation and penalty-term methods [201].
ImageNet large scale visual recognition challenge (ILSVRC) Significant recent research interest has continued showing
[207]. The competition’s metric was top-1 absolute accuracy. improvements for both network pruning categories or a fur-
Static Pruning
Network
Network Model Target Locating Training/Tuning
Pruning
Dynamic Pruning
Pruning Decision
Network Model Runtime Pruning
Strategy Componets
Figure 8: Pruning Categories: Static pruning is performed offline prior to inference while Dynamic pruning is performed at runtime.
ther combination of them. network which contains a series of layers (e.g., convolutional
Recently, new network pruning techniques have been cre- layer, pooling layer, etc.) with 𝑥 as input. 𝐿 represents the
ated. Modern pruning techniques may be classified by various pruned network with 𝑁𝑝 performance loss compared to the
aspects including: 1) structured and unstructured pruning de- unpruned network. Network performance is typically defined
pending if the pruned network is symmetric or not, 2) neuron as accuracy in classification. The pruning function, 𝑃 (⋅),
and connection pruning depending on the pruned element results in a different network configuration 𝑁𝑝 along with
type, or 3) static and dynamic pruning. Figure 8 shows the the pruned weights 𝐖𝑝 . The following sections are primarily
processing differences between static and dynamic pruning. concerned with the influence of 𝑃 (⋅) on 𝑁𝑝 . We also consider
Static pruning has all pruning steps performed offline prior how to obtain 𝐖𝑝 .
to inference while dynamic pruning is performed during run-
time. While there is overlap between the categories, in this 3.1. Static Pruning
paper we will use static pruning and dynamic pruning for Static pruning is a network optimization technique that
classification of network pruning techniques. removes neurons offline from the network after training and
Figure 9 shows a granularity of pruning opportunities. before inference. During inference, no additional pruning
The four rectangles on the right side correspond to the four of the network is performed. Static pruning commonly has
brown filters in the top of Figure 2. Pruning can occur three parts: 1) selection of parameters to prune, 2) the method
on an element-by-element, row-by-row, column-by-column, of pruning the neurons, and 3) optionally fine-tuning or re-
filter-by-filter, or layer-by-layer basis. Typically element-by- training [92]. Retraining may improve the performance of
element has the smallest sparsity impact, and results in a the pruned network to achieve comparable accuracy to the
unstructured model. Sparsity decreases from left-to-right in unpruned network but may require significant offline compu-
Figure 9. tation time and energy.
mathematically described by Equation 5. or at the activation map. The most intuitive magnitude-based
pruning methods is to prune all zero-valued weights or all
( )1
∑
𝑛 𝑝 weights within an absolute value threshold.
‖𝐱‖𝑝 = |𝑥 |𝑝 (5) LeCun as far back as 1990 proposed Optimal Brain Dam-
| 𝑖|
𝑖=1 age (OBD) to prune single non-essential weights [140]. By
using the second derivative (Hessian matrix) of the loss func-
Among the widely applied measurements, the 𝑙1 -norm
tion, this static pruning technique reduced network param-
is also known as the Manhattan norm and the 𝑙2 -norm is
eters by a quarter. For a simplified derivative computation,
also known as the Euclidean norm. The corresponding 𝑙1
OBD functions under three assumptions: 1) quadratic - the
and 𝑙2 regularization have the names LASSO (least absolute
cost function is near-quadratic, 2) extremal - the pruning is
shrinkage and selection operator) and Ridge, respectively
done after the network converged, and 3) diagonal - sums
[230]. The difference between the 𝑙2 -norm pruned tensor
up the error of individual weights by pruning the result of
and an unpruned tensor is called the 𝑙2 -distance. Sometimes
the error caused by their co-consequence. This research also
researchers also use the term 𝑙0 -norm defined as the total
suggested that the sparsity of DNNs could provide opportuni-
number of nonzero elements in a vector.
ties to accelerate network performance. Later Optimal Brain
Surgeon (OBS) [97] extended OBD with a similar second-
⎧𝑁 ( )2 ⎫ order method but removed the diagonal assumption in OBD.
⎪∑ ∑𝑝
⎪ OBS considers the Hessian matrix is usually non-diagonal
arg min ⎨ 𝑦𝑖 − 𝛼 − 𝛽𝑗 𝐱𝑖𝑗 ⎬
𝛼,𝛽
⎪ 𝑖=1 𝑗=1 ⎪ for most applications. OBS improved the neuron removal
⎩ ⎭ (6) precision with up to a 90% reduction in weights for XOR
∑𝑝 networks.
| |
subject to |𝛽𝑗 | ⩽ 𝑡 These early methods reduced the number of connections
| |
𝑗 based on the second derivative of the loss function. The
training procedure did not consider future pruning but still re-
Equation Equation 6 mathematically describes 𝑙2 LASSO
sulted in networks that were amenable to pruning. They also
regularization. Consider a sample consisting of 𝑁 cases, each
suggested that methods based on Hessian pruning would ex-
of which consists of 𝑝 covariates and a single outcome 𝑦𝑖 .
hibit higher accuracy than those pruned with only magnitude-
Let 𝑥𝑖 = (𝑥𝑖1 , ..., 𝑥𝑖𝑝 )𝑇 be the standardized covariate vec-
based algorithms [97]. More recent DNNs exhibit larger
tor for the 𝑖-th case (input feature in DNNs), so we have
∑ ∑ 2 weight values when compared to early DNNs. Early DNNs
𝑖 𝑥𝑖𝑗 ∕𝑁 = 0, 𝑖 𝑥𝑖𝑗 ∕𝑁 = 1. 𝛽 represents the coefficients were also much shallower with orders of magnitude less neu-
𝛽 = (𝛽1 , ..., 𝛽𝑝 ) (weights) and 𝑡 is a predefined tunning pa-
𝑇
rons. GPT-3 [26], for example, contains 175-billion param-
rameter that determines the sparsity. The LASSO estimate 𝛼 eters while VGG-16 [217] contains just 133-million param-
is 0 when the average of 𝑦𝑖 is 0 because for all 𝑡, the solution eters. Calculating the Hessian matrix during training for
∑
for 𝛼 is 𝛼 = 𝑦. If the constraint is 𝑝𝑗 𝛽𝑗2 ⩽ 𝑡 then the Equa- networks with the complexity of GPT-3 is not currently fea-
tion 6 becomes Ridge regression. Removing the constraint sible as it has the complexity of 𝑂(𝑊 2 ). Because of this
will results in the Ordinary Least Squares (OLS) solution. simpler magnitude-based algorithms have been developed
[177, 141].
{ } Filter-wise pruning [147] uses the 𝑙1 -norm to remove
1
𝑎𝑟𝑔 min ‖𝑦 − 𝐗𝛽‖22 + 𝜆 ‖𝛽‖1 (7) filters that do not affect the accuracy of the classification.
𝛽∈ℝ 𝑁
Pruning entire filters and their related feature maps resulted
Equation 6 can be simplified into the so-called Lagrangian in a reduced inference cost of 34% for VGG-16 and 38% for
form shown in Equation 7. The Lagrangian multiplier trans- ResNet-110 on the CIFAR-10 dataset with improved accuracy
lates the objective function 𝑓 (𝑥) and constraint 𝑔(𝑥) = 0 into 0.75% and 0.02%, respectively.
the format of (𝑥, 𝜆) = 𝑓 (𝑥) − 𝜆𝑔(𝑥), Where the ‖ ⋅ ‖𝑝 is the Most network pruning methods choose to measure weights
standard 𝑙𝑝 -norm, the 𝐗 is the covariate matrix that contains rather than activations when rating the effectiveness of prun-
𝑥𝑖𝑗 , and 𝜆 is the data dependent parameter related to 𝑡 from ing [88]. However, activations may also be an indicator to
Equation 6. prune corresponding weights. Average Percentage Of Zeros
Both magnitude-based pruning and penalty based pruning (APoZ) [106] was introduced to judge if one output activa-
may generate zero values or near-zero values for the weights. tion map is contributing to the result. Certain activation
In this section we discuss both methods and their impact. functions, particularly rectification such as Rectified Linear
Unit (ReLU), may result in a high percentage of zeros in
Magnitude-based pruning: It has been proposed and is activations and thus be amenable to pruning. Equation 8
widely accepted that trained weights with large values are
shows the definition of APoZ(𝑖)𝑐 of the 𝑐-th neuron in the 𝑖-th
more important than trained weights with smaller values
[143]. This observation is the key to magnitude-based meth- layer, where 𝐎(𝑖)𝑐 denotes the activation, 𝑁 is the number of
ods. Magnitude-based pruning methods seek to identify un- calibration (validation) images, and 𝑀 is the dimension of
needed weights or features to remove them from runtime eval-
uation. Unneeded values may be pruned either in the kernel
W (l) (4) Wn(l)l ,:,:,: (1)
𝑓 𝐎(𝑖) (𝑘) = 0 Wn(l)l ,:,:,: (1)
𝑐,𝑗
( )
(l)
(2)
…
𝑘=0 𝑗=0 W:,c
l ,:,:
(l)
(2)
APoZ(𝑖) (𝑖)
(8) W:,c
𝑐 = APoZ 𝐎𝑐 = l ,:,:
shape-wise (l)
(3)
W:,cl ,ml ,kl
𝑁 ×𝑀 (l)
W:,cl ,ml ,kl (3)
(l)
filter-wise Wnl ,:,:,: (1) depth-wise W (l) (4)
Similarly, inbound pruning [195], also an activation tech- (l)
W:,c (2)
W (l) (4)
l ,:,:
nique, considers channels that do not contribute to the result. Figure 10: Types
W:,c ,m ,k of Sparsity(3)Geometry, adopted from [241]
(l)
W (l) (4)
Figure 2 are determined to be less contributing, the corre-
sponding channel of the filter in the bottom of the figure will as a whole. Equation 9 gives the pruning constraint where 𝐗
be removed. After pruning this technique achieved about and 𝛽 in Equation 7 are replaced by the higher dimensional
1.5× compression. 𝐗𝐣 and 𝛽𝑗 for the 𝑗 groups.
Filter-wise pruning using a threshold from the sum of
filters’ absolute values can directly take advantage of the 1
Penalty-based pruning: In penalty-based pruning, the goal Figure 10 shows Group LASSO with group shapes used
is to modify an error function or add other constraints, known in Structured Sparsity Learning (SSL) [241]. Weights are
1
as bias terms, in the training process. A penalty value is used split into multiple groups. Unneeded groups of weights are
to update some weights to zero or near zero values. These removed using LASSO feature selection. Groups may be
values are then pruned. determined based on geometry, computational complexity,
Hanson [96] explored hyperbolic and exponential bias group sparsity, etc. SSL describes an example where group
terms for pruning in the late 80s. This method uses weight sparsity in row and column directions may be used to reduce
decay in backpropagation to determine if a neuron should be the execution time of GEMM. SSL has shown improved
pruned. Low-valued weights are replaced by zeros. Residual inference times on AlexNet with both CPUs and GPUs by
zero valued weights after training are then used to prune 5.1× and 3.1×, respectively.
unneeded neurons. Group-wise brain damage [136] also introduced the group
Feature selection [55] is a technique that selects a subset LASSO constraint but applied it to filters. This simulates
of relevant features that contribute to the result. It is also brain damage and introduces sparsity. It achieved 2× speedup
known as attribute selection or variable selection. Feature se- with 0.7% ILSVRC-2012 accuracy loss on the VGG Network.
lection helps algorithms avoiding over-fitting and accelerates Sparse Convolutional Neural Networks (SCNN) [17] take
both training and inference by removing features and/or con- advantage of two-stage tensor decomposition. By decompos-
nections that don’t contribute to the results. Feature selection ing the input feature map and convolutional kernels, the ten-
also aids model understanding by simplifying them to the sors are transformed into two tensor multiplications. Group
most important features. Pruning in DNNs can be considered LASSO is then applied. SCNN also proposed a hardware
to be a kind of feature selection [123]. friendly algorithm to further accelerate sparse matrix compu-
LASSO was previously introduced as a penalty term. tations. They achieved 2.47× to 6.88× speed-up on various
LASSO shrinks the least absolute valued feature’s corre- types of convolution.
sponding weights. This increases weight sparsity. This op- Network slimming [158] applies LASSO on the scaling
eration is also referred to as LASSO feature selection and factors of BN. BN normalizes the activation by statistical
has been shown to perform better than traditional procedures parameters which are obtained during the training phase. Net-
such as OLS by selecting the most significantly contributed work slimming has the effect of introducing forward invisible
variables instead of using all the variables. This lead to ap- additional parameters without additional overhead. Specifi-
proximately 60% more sparsity than OLS [181]. cally, by setting the BN scaler parameter to zero, channel-wise
Element-wise pruning may result in an unstructured net- pruning is enabled. They achieved 82.5% size reduction with
work organizations. This leads to sparse weight matrices that VGG and 30.4% computation compression without loss of
are not efficiently executed on instruction set processors. In accuracy on ILSVRC-2012.
addition they are usually hard to compress or accelerate with- Sparse structure selection [111] is a generalized network
out specialized hardware support [91]. Group LASSO [260] slimming method. It prunes by applying LASSO to sparse
mitigates these inefficiencies by using a structured pruning scaling factors in neurons, groups, or residual blocks. Using
method that removes entire groups of neurons while main- an improved gradient method, Accelerated Proximal Gradi-
taining structure in the network organization [17]. ent (APG), the proposed method shows better performance
Group LASSO is designed to ensure that all the variables without fine-tunning achieving 4× speed-up on VGG-16 with
sorted into one group could be either included or excluded 3.93% ILSVRC-2012 top-1 accuracy loss.
Dropout: While not specifically a technique to prune net- Labeled Faces in the Wild (LFW) dataset [110] in the filed
works, dropout does reduce the number of parameters [222]. of face recognition.
It was originally designed as a stochastic regularizer to avoid A method that iteratively removes redundant neurons for
over-fitting of data [103]. The technique randomly omits a FCLs without requiring special validation data is proposed
percentage of neurons typically up to 50%, This dropout op- in [221]. This approach measures the similarity of weight
eration breaks off part of the connections between neurons to groups after a normalization. It removes redundant weights
avoid co-adaptations. Dropout could also be regarded as an and merges the weights into a single value. This lead to a
operation that separately trains many sub-networks and takes 34.89% reduction of FCL weights on AlexNet with 2.24%
the average of them during the inference phase. Dropout in- top-1 accuracy loss on ILSVRC-2012.
creases training overhead but it does not affect the inference Comparing with the similarity based approach above, DI-
time. Versity NETworks (DIVNET) [167] considers the calculation
Sparse variational dropout [176] added a dropout hyper- redundancy based on the activations. DIVNET introduces
parameter called the dropout rate to reduce the weights of Determinantal Point Process (DPP) [166] as a pruning tool.
VGG-like networks by 68×. During training the dropout rate DPP sorts neurons into categories including dropped and
can be used to identify single weights to prune. This can also retained. Instead of forcing the removal of elements with
be applied with other compression approaches for further low contribution factors, they fuse the neurons by a process
reduction in weights. named re-weighting. Re-weighting works by minimizing the
impact of neuron removal. This minimizes pruning influence
Redundancies: The goal of norm-based pruning algorithms and mitigates network information loss. They found 3% loss
is to remove zeros. This implies that the distribution of values on CIFAR-10 dataset when compressing the network into
should wide enough to retain some values but contain enough half weight.
values close to zero such that a smaller network organization ThiNet [164] adopts statistics information from the next
is still accurate. This does not hold in some circumstances. layer to determine the importance of filters. It uses a greedy
For example, filters that have small norm deviations or a large search to prune the channel that has the smallest reconstruc-
minimum norm have small search spaces making it difficult tion cost in the next layer. ThiNet prunes layer-by-layer in-
to prune based on a threshold [100]. Even when parameter stead of globally to minimize large errors in classification
values are wide enough, in some networks smaller values accuracy. It also prunes less during each training epoch to
may still play an important role in producing results. One allow for coefficient stability. The pruning ratio is a prede-
example of this is when large valued parameters saturate [64]. fined hyper-parameter and the runtime complexity is directly
In these cases magnitude-based pruning of zero values may related to the pruning ratio. ThiNet compressed ResNet-50
decrease result accuracy. FLOPs to 44.17% with a top-1 accuracy reduction of 1.87%.
Similarly, penalty-based pruning may cause network ac- He [101] adopts LASSO regression instead of a greedy
curacy loss. In this case, the filters identified as unneeded algorithm to estimate the channels. Specifically, in one itera-
due to similar coefficient values in other filters may actually tion, the first step is to evaluate the most important channel
be required. Removing them may significantly decrease net- using the 𝑙1 -norm. The next step is to prune the correspond-
work accuracy [88]. Section 3.1.2 describes techniques to ing channel that has the smallest Mean Square Error (MSE).
undo pruning by tuning the weights to minimize network loss Compared to an unpruned network, this approach obtained
while this section describes redundancy based pruning. 2× acceleration of ResNet-50 on ILSVRC-2012 with about
Using BN parameters, feature map channel distances can 1.4% accuracy loss on top-5, and a 4× reduction in execution
be computed by layer [266]. Using a clustering approach time with top-5 accuracy loss of 1.0% for VGG-16. The au-
for distance, nearby features can be tuned. An advantage thors categorize their approach as dynamic inference-time
of clustering is that redundancy is not measured with an channel pruning. However it requires 5000 images for cal-
absolute distance but a relative value. With about 60 epochs ibration with 10 samples per image and more importantly
of training they were able to prune the network resulting results in a statically pruned network. Thus we have placed
in a 50% reduction in FLOPs (including non-convolutional it under static pruning.
operations) with a reduction in accuracy of only 1% for both
top-1 and top-5 on the ImageNet dataset. 3.1.2. Pruning combined with Tuning or Retraining
Filter pruning via geometric median (FPGM) [100] iden- Pruning removes network redundancies and has the bene-
tifies filters to prune by measuring the 𝑙2 -distance using the fit of reducing the number of computations without significant
geometric median. FPGM found 42% FLOPs reduction with impact on accuracy for some network architectures. However,
0.05% top-1 accuracy drop on ILSVRC-2012 with ResNet- as the estimation criterion is not always accurate, some im-
101. portant elements may be eliminated resulting in a decrease in
The reduce and reused (also described as outbound) accuracy. Because of the loss of accuracy, time-consuming
method [195] prunes entire filters by computing the statis- fine-tuning or re-training may be employed to increase accu-
tical variance of each filter’s output using a calibration set. racy [258].
Filters with low variance are pruned. The outbound method Deep compression [92], for example, describes a static
obtained 2.37× acceleration with 1.52% accuracy loss on method to prune connections that don’t contribute to classi-
fication accuracy. In addition to feature map pruning they ResNet classification accuracy with only 5% to 10% size of
also remove weights with small values. After pruning they original weights.
re-train the network to improve accuracy. This process is AutoPruner [163] integrated the pruning and fine-tuning
performed iteratively three times resulting in a 9× to 13× of a three-stage pipeline as an independent training-friendly
reduction in total parameters with no loss of accuracy. Most layer. The layer helped gradually prune during training even-
of the removed parameters were from FCLs. tually resulting in a less complex network. AutoPruner pruned
73.59% of compute operations on VGG-16 with 2.39% ILSVRC-
Recoverable Pruning: Pruned elements usually cannot be 2012 top-1 loss. ResNet-50 resulted in a 65.80% of compute
recovered. This may result in reduced network capability. operations with 3.10% loss of accuracy.
Recovering lost network capability requires significant re-
training. Deep compression required millions of iterations to Training from Scratch: Observation shows that network
retrain the network [92]. To avoid this shortcoming, many ap- training efficiency and accuracy is inversely proportional
proaches adopt recoverable pruning algorithms. The pruned to structure sparsity. The more dense the network, the less
elements may also be involved in the subsequent training training time [94, 147, 70]. This is one reason that current
process and adjust themselves to fit the pruned network. pruning techniques tend to follow a train-prune-tune pipeline
Guo [88] describes a recoverable pruning method using rather than training a pruned structure from scratch.
binary mask matrices to indicate whether a single weight However, the lottery ticket hypothesis [70] shows that it is
value is pruned or not. The 𝑙1 -norm pruned weights can be not of primary importance to preserve the original weights but
stochastically spliced back into the network. Using this ap- the initialization. Experiments show that dense, randomly-
proach AlexNet was able to be reduced by a factor of 17.7× initialized pruned sub-networks can be trained effectively
with no accuracy loss. Re-training iterations were signifi- and reach comparable accuracy to the original network with
cantly reduced to 14.58% of Deep compression [92]. How- the same number of training iterations. Furthermore, stan-
ever this type of pruning still results in an asymmetric network dard pruning techniques can uncover the aforementioned
complicating hardware implementation. sub-networks from a large oversized network - the Winning
Soft Filter Pruning (SFP) [99] further extended recov- Tickets. In contrast with current static pruning techniques,
erable pruning using a dimension of filter. SFP obtained the lottery ticket hypothesis after a period of time drops all
structured compression results with an additional benefit or well-trained weights and resets them to an initial random
reduced inference time. Furthermore, SFP can be used on state. This technique found that ResNet-18 could maintain
difficult to compress networks achieving a 29.8% speed-up comparable performance with a pruning ratio up to 88.2% on
on ResNet-50 with 1.54% ILSVRC-2012 top-1 accuracy loss. the CIFAR-10 dataset.
Comparing with Guo’s recoverable weight [88] technique,
SFP achieves inference speed-ups closer to theoretical re- Towards Better Accuracy: By reducing the number of net-
sults on general purpose hardware by taking advantage of the work parameters, pruning techniques can also help to reduce
structure of the filter. over-fitting. Dense-Sparse-Dense (DSD) training [93] helps
various network improve classification accuracy by 1.1% to
Increasing Sparsity: Another motivation to apply fine-tuning 4.3%. DSD uses a three stage pipeline: 1) dense training to
is to increase network sparsity. Sparse constraints [270] ap- identify important connections, 2) prune insignificant weights
plied low rank tensor constraints [157] and group sparsity and sparse training with a sparsity constraint to take reduce
[57] achieving a 70% reduction of neurons with a 0.57% drop the number of parameters, and 3) re-dense the structure to
of AlexNet in ILSVRC-2012 top-1 accuracy. recover the original symmetric structure, this also increase
the model capacity. The DSD approach has also shown im-
Adaptive Sparsity: No matter what kind of pruning criteria pressive performance on the other type of deep networks such
is applied, a layer-wise pruning ratio usually requires a human as Recurrent Neural Networks (RNNs) and Long Short Term
decision. Too high a ratio resulting in very high sparsity may Memory networks (LSTMs).
cause the network to diverge requiring heavy re-tuning.
Network slimming [158], previously discussed, addresses 3.2. Dynamic Pruning
this problem by automatically computing layer-wise sparsity. Except for recoverable techniques, static pruning perma-
This achieved a 20× model size compression, 5× computing nently destroys the original network structure which may lead
reduction, and less than 0.1% accuracy loss on the VGG to a decrease in model capability. Techniques have been re-
network. searched to recover lost network capabilities but once pruned
Pruning can also be performed using a min-max optimiza- and re-trained, the static pruning approach can’t recover de-
tion module [218] that maintains network accuracy during stroyed information. Additionally, observations shows that
tuning by keeping a pruning ratio. This technique compressed the importance of neuron binding is input-independent [73].
the VGG network by a factor of 17.5× and resulted in a theo- Dynamic pruning determines at runtime which layers,
retical execution time (FLOPs) of 15.56% of the unpruned channels, or neurons will not participate in further activity.
network. A similar approach was proposed with an estima- Dynamic pruning can overcome limitations of static prun-
tion of weights sets [33]. By avoiding the use of a greedy ing by taking advantage of changing input data potentially
search to keep the best pruning ratio, they achieved the same reducing computation, bandwidth, and power dissipation. Dy-
Decision Components
Network Info
Decision Data
Network Data
1.Additional connections or side networks?
Side
2.Layer-wise pruning or channel-wise?
Network 2 5
3.One-shot information input or layer-wise?
Pruning
Decision 4.How to calculate the score?
1 4 7
Additional Connections 5.Predefined thresholds or dynamical?
attached to Network A
6.Continue, skip or exit computing?
7.How to train the decision components?
3
6 ... ...
Input
Image(s)
6
Network A Network B Network C
Cascade Network
namic pruning typically doesn’t perform runtime fine-tuning ending the computation and outputing the predicting
or re-training. In Figure 11, we show an overview of dynamic results [68, 145, 148]. In this case the remaining layers
pruning systems. The most important consideration is the de- are considered to be pruned.
cision system that decides what to prune. The related issues 7. Training the decision component: a) attached con-
are: nections can be trained along with the original net-
1. The type of the decision components: a) additional work [145, 148, 73], b) side networks are typically
connections attached to the original network used dur- trained using reinforcement learning (RL) algorithms
ing the inference phase and/or the training phase, b) [19, 153, 189, 246].
characteristics of the connections that can be learned For instruction set processors, feature maps or the number
by standard backpropagation algorithms [73], and c) a of filters used to identify objects is a large portion of band-
side decision network which tends to perform well but width usage [225] - especially for depth-wise or point-wise
is often difficult to train [153]. convolutions where features consume a larger portion of the
2. The pruning level (shape): a) channel-wise [153, 73, bandwidth [47]. Dynamic tuning may also be applied to stat-
42], b) layer-wise [145], c) block-wise [246], or d) ically pruned networks potentially further reducing compute
network-wise [25]. The pruning level chosen influ- and bandwidth requirements.
ences hardware design. A drawback of dynamic pruning is that the criteria to
3. Input data: a) one-shot information feeding [246] feeds determine which elements to prune must be computed at run-
the entire input to the decision system, and b) layer- time. This adds overhead to the system requiring additional
wise information feeding [25, 68] where a window of compute, bandwidth, and power. A trade-off between dy-
data is iteratively fed to the decision system along with namic pruning overhead, reduced network computation, and
the forwarding. accuracy loss, should be considered. One method to miti-
4. Computing a decision score: 𝑙𝑝 -norm [73], or b) other gate power consumption inhibits computations from 0-valued
approaches [108]. parameters within a Processing Element (PE) [153].
5. Score comparison: a) human experience/experiment
3.2.1. Conditional Computing
results [145] or b) automatic threshold or dynamic
Conditional computing involves activating an optimal
mechanisms [108].
part of a network without activating the entire network. Non-
6. Stopping criteria: a) in the case of layer-wise and
activated neurons are considered to be pruned. They do
network-wise pruning, some pruning algorithms skip
not participate in the result thereby reducing the number of
the pruned layer/network [19, 246], b) some algorithms
computations required. Conditional computing applies to
dynamically choose the data path [189, 259], and c)
training and inference [20, 56]. RL. The MDP reward function in the state-action-reward
Conditional computing has a similarity with RL in that sequence is computation efficiency. Rather than removing
they both learn a pattern to achieve a reward. Bengio [19] layers, a side network of RNP predicts which feature maps are
split the network into several blocks and formulates the block not needed. They found 2.3× to 5.9× reduction in execution
chosen policies as an RL problem. This approach consists time with top-5 accuracy loss from 2.32% to 4.89% for VGG-
of only fully connected neural networks and achieved a 5.3× 16.
speed-up on CIFAR-10 dataset without loss of accuracy.
3.2.3. Differentiable Adaptive Networks
3.2.2. Reinforcement Learning Adaptive Networks Most of the aforementioned decision components are non-
Adaptive networks aim to accelerating network inference differential, thus computationally expensive RL is adopted
by conditionally determining early exits. A trade-off be- for training. A number of techniques have been developed to
tween network accuracy and computation can be applied reduce training complexity by using differentiable methods.
using thresholds. Adaptive networks have multiple interme- Dynamic channel pruning [73] proposes a method to dy-
diate classifiers to provide the ability of an early exit. A namically select which channel to skip or to process using
cascade network is a type of adaptive network. Cascade net- Feature Boosting and Suppression (FBS). FBS is a side net-
works are the combinations of serial networks which all have work that guides channel amplification and omission. FBS is
output layers rather than per-layer outputs. Cascade networks trained along with convolutional networks using SGD with
have a natural advantage of an early exit by not requiring LASSO constraints. The selecting indicator can be merged
all output layers to be computed. If the early accuracy of a into BN parameters. FBS achieved 5× acceleration on VGG-
cascade network is not sufficient, inference could potentially 16 with 0.59% ILSVRC-2012 top-5 accuracy loss, and 2×
be dispatched to a cloud device [145, 25]. A disadvantage of acceleration on ResNet-18 with 2.54% top-1, 1.46% top-5
adaptive networks is that they usually need hyper-parameters accuracy loss.
optimized manually (e.g., confidence score [145]). This intro- Another approach, Dynamic Channel Pruning (DCP)
duces automation challenges as well as classification accuracy [42] dynamically prunes channels using a channel thresh-
loss. They found 28.75% test error on CIFAR-10 when set- old weighting (T-Weighting) decision. Specifically, this mod-
ting the threshold to 0.5. A threshold of 0.99 lowered the ule prunes the channels whose score is lower than a given
error to 15.74% at a cost of 3x to inference time. threshold. The score is calculated by a T-sigmoid activation
A cascading network [189] is an adaptive network with function, which is mathematically described in Equation 10,
an RL trained Composer that can determine a reasonable where 𝜎(𝑥) = 1∕(1 + 𝑒−𝑥 ) is the sigmoid function. The input
computation graph for each input. An adaptive controller to the T-sigmoid activation function is down sampled by a
Policy Preferences is used to intelligently enhance the Com- FCL from the feature maps. The threshold is found using
poser allowing an adjustment of the network computation iterative training which can be a computationally expensive
graph from sub-graphs. The Composer performs much better process. DCP increased VGG-16 top-5 error by 4.77% on
in terms of accuracy than the baseline network with the same ILSVRC-2012 for 5× computation speed-up. By comparison,
number of computation-involved parameters on a modified RNP increased VGG-16 top-5 error by 4.89% [153].
dataset, namely Wide-MNIST. For example, when invoking {
1k parameters, the baseline achieves 72% accuracy while the 𝜎(𝑥), if 𝑥 > 𝑇
ℎ(𝑥) = (10)
Composer obtained 85%. 0, otherwise
BlockDrop [246] introduced a policy network that trained
using RL to make an image-specific determination whether The cascading neural network by Leroux [145] reduced
a residual network block should participate in the follow- the average inference time of overfeat network [211] by 40%
ing computation. While the other approaches compute an with a 2% ILSVRC-2012 top-1 accuracy loss. Their criteria
exit confidence score per layer, the policy network runs only for early exit is based on the confidence score generated by an
once when an image is loaded. It generates a boolean vec- output layer. The auxiliary layers were trained with general
tor that indicates which residual blocks are activate or in- backpropagation. The adjustable score threshold provides a
active. BlockDrop adds more flexibility to the early exit trade-off between accuracy and efficiency.
mechanism by allowing a decision to be made on any block Bolukbasi [25] reports a system that contains a com-
and not just early blocks in Spatially Adaptive Computation bination of other SOTA networks (e.g., AlexNet, ResNet,
Time (SACT) [68]. This is discussed further in Section 3.2.3. GoogLeNet, etc.). A policy adaptively chooses a point to
BlockDrop achieves an average speed-up of 20% on ResNet- exit early. This policy can be trained by minimizing its cost
101 for ILSVRC-2012 without accuracy loss. Experiments function. They format the system as a directed acyclic graph
using the CIFAR dataset showed better performance than with various pre-trained networks as basic components. They
other SOTA counterparts at that time [68, 82, 147]. evaluate this graph to determine leaf nodes for early exit.
Runtime Neural Pruning (RNP) [153] is a framework The cascade of acyclic graphs with a combination of various
that prunes neural networks dynamically. RNP formulates networks reduces computations while maintaining predic-
the feature selection problem as a Markov Decision Process tion accuracy. ILSVRC-2012 experiments show ResNet-50
(MDP) and then trains an RNN-based decision network by acceleration of 2.8× with 1% top-5 accuracy loss and 1.9×
speed-up with no accuracy loss.
Considering the similarity of RNNs and residual networks This implies the pruned architecture itself is crucial to suc-
[83], Spatially Adaptive Computation Time (SACT) [68] cess. By this observation, the pruning algorithms could be
explored an early stop mechanism of residual networks in seen as a type of NAS. Liu concluded that because the weight
the spatial domain. SACT can be applied to various tasks values can be re-trained, by themselves they are not effica-
including image classification, object detection, and image cious. However, the lottery ticket hypothesis [70] achieved
segmentation. SACT achieved about 20% acceleration with comparable accuracy only when the weight initialization
no accuracy loss for ResNet-101 on ILSVRC-2012. was exactly the same as the unpruned model. Glae [72]
To meet the computation constraints, Multi-Scale Dense resolved the discrepancy by showing that what really matters
Networks (MSDNets) [108] designed an adaptive network is the pruning form. Specifically, unstructured pruning can
using two techniques: 1) an anytime-prediction to generate only be fine-tuned to restore accuracy but structured pruning
prediction results at many nodes to facilitate the network’s can be trained from scratch. In addition, they explored the
early exit and 2) batch computational budget to enforce a performance of dropout and 𝑙0 regularization. The results
simpler exit criteria such as a computation limit. MSDNets showed that simple magnitude based pruning can perform
combine multi-scale feature maps [265] and dense connec- better. They developed a magnitude based pruning algorithm
tivity [109] to enable accurate early exit while maintaining and showed the pruned ResNet-50 obtained higher accuracy
higher accuracy. The classifiers are differentiable so that than SOTA at the same computational complexity.
MSDNets can be trained using stochastic gradient descent.
MSDNets achieve 2.2× speed-up at the same accuracy for
ResNet-50 on ILSVRC-2012 dataset. 4. Quantization
To address the training complexity of adaptive networks, Quantization is known as the process of approximating
Li [148] proposed two methods. The first method is gradient a continuous signal by a set of discrete symbols or integer
equilibrium (GE). This technique helps backbone networks values. Clustering and parameter sharing also fall within
converge by using multiple intermediate classifiers across this definition [92]. Partial quantization uses clustering al-
multiple different network layers. This improves the gradi- gorithms such as k-means to quantize weight states and then
ent imbalance issue found in MSDNets [108]. The second store the parameters in a compressed file. The weights can be
method is an Inline Subnetwork Collaboration (ISC) and a decompressed using either a lookup table or a linear transfor-
One-For-All knowledge distillation (OFA). Instead of inde- mation. This is typically performed during runtime inference.
pendently training different exits, ISC takes early predictions This scheme only reduces the storage cost of a model. This
into later predictors to enhance their input information. OFA is discussed in Section 4.2.4. In this section we focus on
supervises all the intermediate exits using a final classifier. At numerical low-bit quantization.
a same ILSVRC-2012 top-1 accuracy of 73.1%, their network Compressing CNNs by reducing precision values has
takes only one-third the computational budget of ResNet. been previously proposed. Converting floating-point parame-
Slimmable Neural Networks (SNN) [259] are a type of ters into low numerical precision datatypes for quantizing neu-
networks that can be executed at different widths. Also known ral networks was proposed as far back as the 1990s [67, 14].
as switchable networks, the network enables dynamically Renewed interest in quantization began in the 2010s when 8-
selecting network architectures (width) without much compu- bit weight values were shown to accelerate inference without
tation overhead. Switchable networks are designed to adap- a significant drop in accuracy [233].
tively and efficiently make trade-offs between accuracy and Historically most networks are trained using FP32 num-
on-device inference latency across different hardware plat- bers [225]. For many networks an FP32 representation has
forms. SNN found that the difference of feature mean and greater precision than needed. Converting FP32 parameters
variance may lead to training faults. SNN solves this issue to lower bit representations can significantly reduce band-
with a novel switchable BN technique and then trains a wide width, energy, and on-chip area.
enough network. Unlike cascade networks which primar- Figure 12 shows the evolution of quantization techniques.
ily benefit from specific blocks, SNN can be applied with Initially, only weights were quantized. By quantizing, cluster-
many more types of operations. As BN already has two pa- ing, and sharing, weight storage requirements can be reduced
rameters as mentioned in Section 2, the network switch that by nearly 4×. Han [92] combined these techniques to reduce
controls the network width comes with little additional cost. weight storage requirements from 27MB to 6.9MB. Post train-
SNN increased top-1 error by 1.4% on ILSVRC-2012 while ing quantization involves taking a trained model, quantizing
achieving about 2× speed-up. the weights, and then re-optimizing the model to generate a
quantized model with scales [16]. Quantization-aware train-
3.3. Comparisons ing involves fine-tuning a stable full precision model or re-
Pruning techniques are diverse and difficult to compare. training the quantized model. During this process real-valued
Shrinkbench [24] is a unified benchmark framework aiming weights are often down-scaled to integer values - typically
to provide pruning performance comparisons. 8-bits [120]. Saturated quantization can be used to generate
There exist ambiguities about the value of the pre-trained feature scales using a calibratation algorithm with a calibra-
weights. Liu [160] argues that the pruned model could be tion set. Quantized activations show similar distributions
trained from scratch using a random weight initialization. with previous real-valued data [173]. Kullback-Leibler di-
Figure 12: Quantization Evolution: The development of quantization techniques, from left to right. Purple rectangles indicated
quantized data while blue rectangles represent full precision 32-bit floating point format.
vergence (KL-divergence, also known as relative entropy or There are many methods to quantize a given network. Gener-
information divergence) calibrated quantization is typically ally, they are formulated as Equation 12 where 𝑠 is a scalar
applied and can accelerate the network without accuracy loss that can be calculated using various methods. 𝑔(⋅) is the
for many well known models [173]. Fine-tuning can also be clamp function applied to floating-point values 𝐗𝑟 perform-
applied with this approach. ing the quantization. 𝑧 is the zero-point to adjust the true
KL-divergence is a measure to show the relative entropy zero in some asymmetrical quantization approaches. 𝑓 (⋅) is
of probability distributions between two sets. Equation 11 the rounding function. This section introduces quantization
gives the equation for KL-divergence. 𝑃 and 𝑄 are defined using the mathematical framework of Equation 12.
as discrete probability distributions on the same probability
space. Specifically, 𝑃 is the original data (floating-point)
distribution that falls in several bins. 𝑄 is the quantized data 𝑐𝑙𝑎𝑚𝑝(𝑥, 𝛼, 𝛽) = 𝑚𝑎𝑥(𝑚𝑖𝑛(𝑥, 𝛽), 𝛼) (13)
histogram. Equation 13 defines a clamp function. The min-max
( ) method is given by Equation 14 where [𝑚, 𝑀] are the bounds
∑
𝑁
𝑃 (𝑥𝑖 )
𝐷KL (𝑃 ‖𝑄) = 𝑃 (𝑥𝑖 ) log (11) for the minimum and maximum values of the parameters, re-
𝑖=0
𝑄(𝑥𝑖 ) spectively. 𝑛 is the maximum representable number derived
from the bit-width (e.g., 256 = 28 in case of 8-bit), and 𝑧, 𝑠
Depending upon the processor and execution environ- are the same as in Equation 12. 𝑧 is typically non-zero in the
ment, quantized parameters can often accelerate neural net- min-max method [120].
work inference.
Quantization research can be categorized into two focus 𝑔(𝑥) = 𝑐𝑙𝑎𝑚𝑝(𝑥, 𝑚, 𝑀)
areas: 1) quantization aware training (QAT) and 2) post train- 𝑛−1 𝑚 × (1 − 𝑛)
ing quantization (PTQ). The difference depends on whether 𝑠= , 𝑧= (14)
𝑀 −𝑚 𝑀 −𝑚
training progress is is taken into account during training. Al- where 𝑚 = min{𝐗𝑖 }, 𝑀 = max{𝐗𝑖 }
ternatively, we could also categorize quantization by where
data is grouped for quantization: 1) layer-wise and 2) channel- The max-abs method uses a symmetry bound shown in
wise. Further, while evaluating parameter widths, we could Equation 15. The quantization scale 𝑠 is calculated from
further classify by length: N-bit quantization. the largest one 𝑅 among the data to be quantized. Since the
Reduced precision techniques do not always achieve the bound is symmetrical, the zero point 𝑧 will be zero. In such
expected speedup. For example, INT8 inference doesn’t a situation, the overhead of computing an offset-involved
achieve exactly 4× speedup over 32-bit floating point due convolution will be reduced but the dynamic range is reduced
to the additional operations of quantization and dequanti- since the valid range is narrower. This is especially noticeable
zation. For instance, Google’s TensorFlow-Lite [227] and for ReLU activated data where all of which values fall on the
nVidia’s Tensor RT [173] INT8 inference speedup is about positive axis.
2-3×. Batch size is the capability to process more than one
image in the forward pass. Using larger batch sizes, Tensor 𝑔(𝑥) = 𝑐𝑙𝑎𝑚𝑝(𝑥, −𝑀, 𝑀)
RT does achieve 3-4× acceleration with INT8 [173]. 𝑠=
𝑛−1
, 𝑧=0 (15)
Section 8 summarizes current quantization techniques 𝑅
used on the ILSVRC-2012 dataset along with their bit-widths where 𝑅 = max{𝑎𝑏𝑠{𝐗𝑖 }}
for weights and activation.
Quantization can be applied on input features 𝐅, weights
4.1. Quantization Algebra 𝐖, and biases 𝐛. Taking feature 𝐅 and weights 𝐖 as an
example (ignoring the biases) and using the min-max method
gives Equation 16. The subscripts 𝑟 and 𝑞 denote the real-
𝐗𝑞 = 𝑓 (𝑠 × 𝑔(𝐗𝑟 ) + 𝑧) (12) valued and quantized data, respectively. The 𝑚𝑎𝑥 suffix is
𝑠𝑏 = 𝑠𝑤 × 𝑠𝑓 , 𝐛𝑞 = 𝐛𝑟 × 𝑠𝑏 (20)
biases
tion 21. 𝑖𝑑𝑥𝑖 (𝑛) is the index for the 𝑖𝑡ℎ weights in the 𝑛𝑡ℎ
weights uint32 code-book. Each coded weight 𝑤𝑖 can be indexed by the
uint8
activation
NB-bit expression.
conv + ReLU6
uint8
feature
uint8
∑
𝑁
[ ]
𝑤𝑖 = 𝐂𝑛 idx𝑖 (𝑛)
𝑛=1
Figure 13: Integer Arithmetic-only Inference: The convolution { } (21)
operation takes unsigned int8 weights and inputs, accumulates 𝐂𝑛 = 0, ±2−𝑛+1 , ±2−𝑛 , ±2−𝑛−1 , … , ±2−𝑛−⌊𝑀∕2⌋+2
them to unsigned int32, and then performs a 32-bit addi-
tion with biases. The ReLU6 operation outputs 8-bit integers. where 𝑀 = 2𝐵 − 1
Adopted from [120]
Note that the number of code-books 𝐶𝑛 can be greater than
one. This means the encoded weight might be a combination
of multiple shift operations. This property allows ShiftCNN
4.2.2. Logarithmic Quantization to expand to a relatively large-scale quantization or to shrink
Bit-shift operations are inexpensive to implement in hard- to binarized or ternary weights. We discuss ternary weights in
ware compared to multiplication operations. FPGA imple- Section 4.2.3. ShiftCNN was deployed on an FPGA platform
mentations [6] specifically benefit by converting floating- and achieved comparable accuracy on the ImageNet dataset
point multiplication into bit shifts. Network inference can with 75% power saving and up to 1090× clock cycle speed-up.
be further optimized if weights are also constrained to be ShiftCNN achieves this impressive result without requiring re-
power-of-two with variable-length encoding. Logarithmic training. With 𝑁 = 2 and 𝐵 = 4 encoding, SqueezeNet [115]
quantization takes advantage of this by being able to express has only 1.01% top-1 accuracy loss. The loss for GoogLeNet,
a larger dynamic range compared to linear quantization. ResNet-18, and ResNet-50 is 0.39%, 0.54%, and 0.67%, re-
Inspired by binarized networks [52], introduced in Sec- spectively, While compressing the weights into 7/32 of the
tion 4.2.3, Lin [156] forced the neuron output into a power- original size. This implies that the weights have significant
of-two value. This converts multiplications into bit-shift redundancy.
operations by quantizing the representations at each layer of Based on LogNN, Cai [30] proposed improvements by
the binarized network. Both training and inference time are disabling activation quantization to reduce overhead during
thus reduced by eliminating multiplications. inference. This also reduced the clamp bound hyperparameter
Incremental Network Quantization (INQ) [269] replaces tuning during training. These changes resulted in many low-
weights with power-of-two values. This reduces computa- valued weights that are rounded to the nearest value during
tion time by converting multiplies into shifts. INQ weight encoding. As 2𝑛 s.t. 𝑛 ∈ 𝑁 increases quantized weights
quantization is performed iteratively. In one iteration, weight sparsity as 𝑛 increases. In this research, 𝑛 is allowed to be
pruning-inspired weight partitioning is performed using group- real-valued numbers as 𝑛 ∈ 𝑅 to quantize the weights. This
wise quantization. These weights are then fine-tuned by using makes weight quantization more complex. However, a code-
a pruning-like measurement [92, 88]. Group-wise retraining book helps to reduce the complexity.
fine-tunes a subset of weights in full precision to preserve In 2019, Huawei proposed DeepShift, a method of sav-
ensemble accuracy. The other weights are converted into ing computing power by shift convolution [62]. DeepShift
power-of-two format. After multiple iterations most of the removed all floating-point multiply operations and replaced
full precision weights are converted to power-of-two. The them with bit reverse and bit shift. The quantized weight
final networks have weights from 2 (ternary) to 5 bits with 𝑊𝑞 transformation is shown mathematically in Equation 22,
values near zero set to zero. Results of group-wise iterative where 𝑆 is a sign matrix, 𝑃 is a shift matrix, and 𝑍 is the set
quantization show lower error rates than a random power-of- of integers.
two strategy. Specifically, INQ obtained 71× compression
with 0.52% top-1 accuracy loss on the ILSVRC-2012 with 𝑊𝑞 = 𝑆 × 2𝑃 , s.t. 𝑃 ∈ ℤ, 𝑆 ∈ {−1, 0, +1} (22)
AlexNet.
Logarithmic Neural Networks (LogNN) [175] quantize Results indicate that DeepShift networks cannot be easily
weights and features into a log-based representation. Loga- trained from scratch. They also show that shift-format net-
rithmic backpropagation during training is performed using works do not directly learn for lager datasets such as Im-
shift operations. Bases other than 𝑙𝑜𝑔2 can be used. 𝑙𝑜𝑔√2 agenet. Similar to INQ, they show that fine-tuning a pre-
trained network can improve performance. For example,
based arithmetic is described as a trade-off between dynamic
with the same configuration of 32-bit activations and 6-bit
range and representation precision. 𝑙𝑜𝑔2 showed 7× compres-
shift-format weights, the top-1 ILSVRC-2012 accuracy loss
sion with 6.2% top-5 accuracy loss on AlexNet, while 𝑙𝑜𝑔√2
on ResNet-18 for trained from scratch and tuned from a pre-
showed 1.7% top-5 accuracy loss. trained model are 4.48% and 1.09%, respectively.
Shift convolutional neural networks (ShiftCNN) [84] im- DeepShift proposes models with differential backpropa-
prove efficiency by quantizing and decomposing the real- gation for generating shift coefficients during the retraining
valued weights matrix into an 𝑁 times 𝐵 ranged bit-shift, process. DeepShift-Q [62] is trained with floating-point pa-
and encoding them with code-books 𝐂 as shown in Equa- rameters in backpropagation with values rounded to a suitable
format during inference. DeepShift-PS directly adopts the 𝑥𝑏 with a hard sigmoid probability 𝜎(𝑥). Both the activations
shift 𝑃 and sign 𝑆 parameters as trainable parameters. and the gradients use 32-bit single precision floating point.
Since logarithmic encoding has larger dynamic range, The trained BC network shows 1.18% classification error
redundant networks particularly benefit. However, less redun- on the small MNIST dataset but 8.27% classification on the
dant networks show significant accuracy loss. For example, larger CIFAR-10 dataset.
VGG-16 which is a redundant network shows 1.31% accuracy {
loss on top-1 while DenseNet-121 shows 4.02% loss. +1, with probability 𝑝 = 𝜎(𝑥)
𝑥𝑏 =
−1, with probability 1 − 𝑝 (25)
4.2.3. Plus-minus Quantization ( )
𝑥+1
Plus-minus quantization was in 1990 [208]. This tech- where 𝜎(𝑥) = clamp , 0, 1
2
nique reduces all weights to 1-bit representations. Similar
to logarithmic quantization, expensive multiplications are Courbariaux extended BC networks by binarizing the
removed. In this section, we provide an overview of signifi- activations. He named them BinaryNets [53], which is recog-
cant binarized network results. Simons [216] and Qin [198] nized as the first BNN. They also report a customized binary
provide an in-depth review of BNNs. matrix multiplication GPU kernel that accelerates the calcu-
Binarized neural networks (BNN) have only 1-bit weights lation by 7×. BNN is considered the first binarized neural
and often 1-bit activations. 0 and 1 are encoded to represent network where both weights and activations are quantized
-1 and +1, respectively. Convolutions can be separated into to binary values [216]. Considering the hardware cost of
multiplies and additions. In binary arithmetic, single bit stochastic binarization, they made a trade-off to apply deter-
operations can be performed using and, xnor, and bit-count. ministic binarization in most circumstances. BNN reported
We follow the introduction from [273] to explain bit-wise 0.86% error on MNIST, 2.53% error on SVHN, and 10.15%
operation. Single bit fixed point dot products are calculated error on CIFAR-10. The ILSVRC-2012 dataset accuracy
as in Equation 23, where and is a bit-wise AND operation results for binarized AlexNet and GoogleNet are 36.1% top-1
and bitcount counts the number of 1’s in the bit string. and 47.1%, respectively while the FP32 original networks
achieve 57% and 68%, respectively [112].
𝒙 ⋅ 𝒚 = bitcount(and(𝒙, 𝒚)), s.t. ∀𝑖, 𝑥𝑖 , 𝑦𝑖 ∈ {0, 1} (23) Rastegari [200] explored binary weight networks (BWN)
on the ILSVRC dataset with AlexNet and achieved the same
This can be extended into multi-bit computations as in Equa- classification accuracy as the single precision version. The
tion 24 [53]. 𝒙 and 𝒚 are M-bit and K-bit fixed point inte- key is a scaling factor 𝛼 ∈ ℝ+ applied to an entire layer of
∑ ∑𝐾−1
gers, subject to 𝒙 = 𝑀−1
𝑚=0 𝑐𝑚 (𝒙)2 and 𝒚 =
𝑚
𝑘=0 𝑐𝑘 (𝒚)2
𝑘 binarized weights 𝐁. This results in similar weights values
, where (𝑐𝑚 (𝒙))𝑀−1 and (𝑐𝑘 (𝒚))𝐾−1 are bit vectors. as if they were computed using FP32 𝐖 ≈ 𝛼𝐁. They also
𝑚=0 𝑘=0
applied weight binarization on ResNet-18 and GoogLeNet,
∑ 𝐾−1
𝑀−1 ∑ [ ( )] resulting in 9.5% and 5.8% top-1 accuracy loss compared
x⋅y = 2𝑚+𝑘 bitcount and 𝑐𝑚 (x), 𝑐𝑘 (y) , to the FP32 version, respectively. They also extended bina-
(24)
𝑚=0 𝑘=0 rization to activations called XNOR-Net and evaluated it on
s.t. 𝑐𝑚 (x)𝑖 , 𝑐𝑘 (y)𝑖 ∈ {0, 1}∀𝑖, 𝑚, 𝑘. the large ILSVRC-2012 dataset. Compared to BNN, XNOR-
Net also applied a scaling factor on the input feature and a
By removing complicated floating-point multiplications, rearrangement of the network structure (swapping the con-
networks are dramatically simplified with simple accumula- volution, activation, and BN). Finally, XNOR-Net achieved
tion hardware. Binarization not only reduces the network size 44.2% top-1 classification accuracy on ILSVRC-2012 with
by up-to 32×, but also drastically reduces memory usage re- AlexNet, while accelerating execution time 58× on CPUs.
sulting in significantly lower energy consumption [174, 112]. The attached scaling factor extended the binarized value ex-
However, reducing 32-bit parameters into a single bit results pression, which reduced the network distortion and lead to
in a significant loss of information, which decreases predic- better ImageNet accuracy.
tion accuracy. Most quantized binary networks significantly DoReFa-Net [272] also adopts plus-minus arithmetic for
under-perform compared to 32-bit competitors. quantized network. DoReFa additionally quantizes gradients
There are two primary methods to reduce floating-point to low-bit widths within 8-bit expressions during the back-
values into a single bit: 1) stochastic and 2) deterministic [52]. ward pass. The gradients are quantized stochastically in back
Stochastic methods consider global statistics or the value of propagation. For example, it takes 1 bit to represent weights
input data to determine the probability of some parameter to layer-wise, 2-bit activations, and 6-bits for gradients. We
be -1 or +1. Deterministic binarization directly computes describe training details in Section 4.2.5. They found 9.8%
the bit value based on a threshold, usually 0, resulting in a top-1 accuracy loss on AlexNet with ILSVRC-2012 using
sign function. Deterministic binarization is much simpler to the 1-2-6 combination. The result for the 1-4-32 combination
implement in hardware. is 2.9%.
Binary Connect (BC), proposed by Courbariaux [52], Li [146] and Leng [144] showed that for ternary weights
is an early stochastic approach to binarize neural networks. (−1, 0, and + 1), in Ternary Weight Networks (TWN), only
They binarized the weights both in forward and backward a slight accuracy loss was realized. Compared to BNN, TWN
propagation. Equation 25 shows the stochastic binarization has an additional value to reduce information loss while still
accuracy loss on ILSVRC-2012. Weight regularization can plied for propagating gradients by using discretization [112].
slightly improve the accuracy of quantized networks by pe- Equation 28 show the STE for sign binarization, where 𝑐
nalizing weights with large magnitudes [215]. Experiments denotes the cost function, 𝑤𝑟 is the real-valued weights, and
showed that 𝑙2 regularization improved 8-bit quantized Mo- 𝑤𝑏 is the binarized weight produced by the sign function.
bileNet top-1 accuracy by 0.23% on ILSVRC-2012. STE bypasses the binarization function to directly calculate
BN has proved to have many advantages including ad- real-valued gradients. The floating-point weights are then up-
dressing the internal covariate shift issue [119]. It can also dated using methods like SGD. To avoid real-valued weights
be considered a type of quantization. However, quantization approaching infinity, BNNs typically clamp floating-point
performed with BN may have numerical instabilities. The weights to the desired range of ±1 [112].
BN layer has nonlinear square and square root operations. ( )
Low bit representations may be problematic when using non- Forward : 𝑤𝑏 = sign 𝑤𝑟
linear operations. To solve this, 𝑙1 -norm BN [245] has only 𝜕𝑐 𝜕𝑐 (28)
Backward : = 𝟏|𝑤𝑟 |≤1
linear operations in both forward and backward training. It 𝜕𝑤𝑟 𝜕𝑤𝑏
provided 1.5× speedup at half the power on FPGA platforms
and can be used with both training and inference. Unlike the forward phase where weights and activations
are produced with deterministic quantization, in the gradient
4.2.5. Quantization-aware Training phase, the low bit gradients should be generated by stochas-
Most quantization methods use a global (layer-wise) quan- tic quantization [89, 271]. DoReFa [272] first successfully
tization to reduce the full precision model into a reduced bit trained a network with gradient bit-widths less than eight and
model. Thus can result in non-negligible accuracy loss. A sig- achieved a comparable result with 𝑘-bit quantization arith-
nificant drawback of quantization is information loss caused metic. This low bit-width gradient scheme could accelerate
by the irreversible precision reducing transform. Accuracy training in edge devices with little impact to network accu-
loss is particularly visible in binary networks and shallow net- racy but minimal inference acceleration compared to BNNs.
works. Applying binary weights and activations to ResNet-34 DoReFa quantizes the weights, features, and gradients into
or GoogLeNet resulted in 29.10% and 24.20% accuracy loss, many levels obtaining a larger dynamic range than BNNs.
respectively [53]. It has been shown that backward propaga- They trained AlexNet on ImageNet from scratch with 1-bit
tion fine-tunes (retrains) a quantized network and can recover weights, 2-bit activations, and 6-bit gradients. They obtained
losses in accuracy caused by the quantization process [171]. 46.1% top-1 accuracy (9.8% loss comparing with the full
The retraining is even resilient to binarization information precision counterpart). Equation 29 shows the weight quan-
distortions. Thus training algorithms play a crucial role when tizing approach. 𝑤 is the weights (the same as in Equation 28),
using quantization. In this section, we introduce (re)training limit is a limit function applied to the weights keeping them
of quantized networks. in the range of [0, 1], and quantize𝑘 quantizes the weights
into 𝑘-levels. Feature quantization is performed using the
BNN Training: For a binarized network that has binary val- 𝑓𝛼𝑘 = quantize𝑘 function.
ued weights it is not effective to update the weights using ( )
gradient decent methods due to typically small derivatives. 𝑓𝑤𝑘 = 2 quantize𝑘 limit(𝑤𝑟 ) − 1
Early quantized networks were trained with a variation of 1 (( ) )
Bayesian inference named Expectation Back Propagation where quantize𝑘 (𝑤𝑟 ) = 𝑘 round 2𝑘 − 1 𝑤𝑟 , (29)
2 −1
(EBP) [220, 41]. This method assigns limited parameter pre- tanh(𝑥) 1
cision (e.g., binarized) weights and activations. EBP infers and limit(𝑥) = +
2 max(| tanh(𝑥)|) 2
networks with quantized weights by updating the posterior
distributions over the weights. The posterior distributions are In DoReFa, gradient quantization is shown in Equation 30,
updated by differentiating the parameters of the backpropa- where d𝑟 = 𝜕𝑐∕𝜕𝑟 is the backprogagated gradient of the cost
gation. function 𝑐 to output 𝑟.
BinaryConnect [52] adopted the probabilistic idea of
[ ( ) ]
EBP but instead of optimizing the weights posterior distri- d𝑟 1 1
bution, BC preserved floating-point weights for updates and 𝑓̃𝛾𝑘 = 2 max0 (|d𝑟|) quantize𝑘 + −
2 max0 (|d𝑟|) 2 2
then quantized them into binary values. The real-valued (30)
weights update using the back propagated error by simply
ignoring the binarization in the update. As in deep feed forward networks, the exploding gradi-
A binarized Network has only 1-bit parameters - ±1 quan- ent problem can cause BNN’s not to train. To address this
tized from a sign function. Single bit parameters are non- issue, Hou [104] formulated the binarization effect on the net-
differentiable and therefore it is not possible to calculate gra- work loss as an optimization problem which was solved by a
dients needed for parameter updating [208]. SGD algorithms proximal Newton’s algorithm with diagonal Hessian approx-
have been shown to need 6 to 8 bits to be effective [180]. To imation that directly minimizes the loss with respect to the
work around these limitations the Straight-Through Estima- binary weights. This optimization found 0.09% improvement
tor (STE), previously introduced by Hinton [102], was ap- on MNIST dataset compared with BNN.
F32
weights and quantized weights are both kept. During training
𝛼 is gradually raised to 1 until a fully quantized network is optimizer quantizer quantizer
loss scaling float2half float2half
realized.
F16
weights feature
weights weights
float FP16 FP16
float
STE AB
quantizer
quantizer
1-α
weights feature
weights feature
binary float conv
binary float
α
+
F16
F16
conv conv
activation
FP16
Table 2
Low Precision Libraries Using Quantization: QAT is quantization-aware training, PTQ is
post-training quantization, and offset indicates the zero point 𝑧 in Equation 12.
Name Institution Core Lib Precision Method Platform Open-sourced
ARM CMSIS NN [129] Arm CMSIS 8-bit deploy only Arm Cortex-M Processor No
MACE [247] XiaoMi - 8-bit QAT and PTQ Mobile - CPU, Hexagon Chips, MTK APU Yes
MKL-DNN [204] Intel - 8-bit PTQ, mixed offset, and QAT Intel AVX Core Yes
NCNN [229] Tencent - 8-bit PTQ w/o offset Mobile Platform Yes
Paddle [13] Baidu - 8-bit QAT and PTQ w/o offset Mobile Platform Yes
QNNPACK [61] Fackbook - 8-bit PTQ w/ offset Mobile Platform Yes
Ristretto [90] LEPS gemm 3 method QAT Desktop Platform Yes
SNPE [228] Qualcomm - 16/8-bit PTQ w/ offset, max-min Snapdragon CPU, GPU, DSP No
Tensor-RT [173] nVidia - 8-bit PTQ w/o offset nVidia GPU Yes
TF-Lite [1] Google gemmlowp 8-bit PTQ w/ offset Mobile Platform Yes
PCIe with platform atomics can share the same address space tation [173]. KL calibration can significantly help to avoid
[74]. Floating-point arithmetic units consume more energy accuracy loss.
and take longer to compute compared to integer arithmetic The method traverses a predefined possible range of scales
units. Consequently, low-bitwidth architectures are designed and calculates the KL-divergences for all the points. It then
to accelerate computation [179]. Specialized algorithms and selects the scale which minimizes the KL-divergence. KL-
efficient hardware can accelerate neural network processing divergence is widely used in many post training acceler-
during both training and inference [202]. ation frameworks. nVidia found a model calibrated with
125 images showed only 0.36% top-1 accuracy loss using
4.3.2. Efficient Kernels GoogLeNet on the Imagenet dataset.
Typically low precision inference in only executed on
convolutional layers. Intermediate values passed between Intel MKL-DNN [204] is an optimized computing library
layers use 32-bit floating-point. This makes many of the for Intel processors with Intel AVX-512, AVX-2, and SSE4.2
frameworks amenable to modifications. Instruction Set Architectures (ISA). The library uses FP32 for
Table 2 gives a list of major low precision acceleration training and inference. Inference can also be performed using
frameworks and libraries. Most of them use INT8 precision. 8-bits in convolutional layers, ReLU activations, and pool-
We will next describe some popular and open-source libraries ing layers. It also uses Winograd convolutions. MKL-DNN
in more detail. uses max-abs quantization shown in Equation 15, where the
feature adopts unsigned 8-bit integer 𝑛𝑓 = 256 and signed
Tensor RT [232, 242] is an nVidia developed C++ library 8-bit integer weights 𝑛𝑤 = 128. The rounding function 𝑓 (⋅)
that facilitates high-performance inference on NVIDIA GPUs. in Equation 12 uses nearest integer rounding. Equation 32
It is a low precision inference library that eliminates the bias shows the quantization applied on a given tensor or each
term in convolutional layers. It requires a calibration set to channel in a tensor. The maximum of weights 𝑅𝑤 and fea-
adjust the quantization thresholds for each layer or channel. tures 𝑅𝑓 is calculated from the maximum of the absolute
Afterwards the quantized parameters are represented by 32- value (nearest integer rounding) of the tensor 𝕋𝑓 and 𝕋𝑤 . The
bit floating-point scalar and INT8 weights. feature scale 𝑠𝑓 and weights scale 𝑠𝑤 are generated using 𝑅𝑤
Tensor RT takes a pre-trained floating-point model and and 𝑅𝑓 . Then quantized 8-bit signed integer weights 𝐖𝑠8 ,
generates a reusable optimized 8-bit integer or 16-bit half 8-bit unsigned integer feature 𝐅𝑢8 and 32-bit unsigned inte-
float model. The optimizer performs network profiling, layer ger biases 𝐁𝑢32 are generated using the scales and a nearest
fusion, memory management, and operation concurrency. rounding function ‖ ⋅ ‖.
Equation 31 shows the convolution-dequantization dataflow
in Tensor RT for 8-bit integers. The intermediate result of
convolution by INT8 input feature 𝐅𝑖8 and weights 𝐖𝑖8 are 𝑅{𝑓 ,𝑤} = 𝑚𝑎𝑥((𝑎𝑏𝑠(𝕋{𝑓 ,𝑤} ))
accumulated into INT32 tensor 𝐎𝑖32 . They are dequantized 255 127
𝑠𝑓 = , 𝑠𝑤 =
by dividing by the feature and weight scales 𝑠𝑓 , 𝑠𝑤 . 𝑅𝑓 𝑅𝑤
(32)
𝐖𝑠8 = ‖𝑠𝑤 × 𝐖𝑓 32 ‖ ∈ [−127, 127]
𝐎𝑖32 𝐅𝑢8 = ‖𝑠𝑓 × 𝐅𝑓 32 ‖ ∈ [0, 255]
𝐎𝑖32 = 𝐅𝑖8 ∗ 𝐖𝑖8 , 𝐎𝑓 32 = (31)
𝑠𝑓 × 𝑠𝑤
𝐁𝑠32 = ‖𝑠𝑓 × 𝑠𝑤 × 𝐁𝑓 32 ‖ ∈ [−231 , 231 − 1]
Tensor RT applies a variant of max-abs quantization to
An affine transformation using 8-bit multipliers and 32-
reduce storage requirements and calculation time of the zero
bit accumulates results in Equation 33 with the same scale
point term 𝑧 in Equation 15 by finding the proper thresh-
factors as defined in Equation 32 and ∗ denoting convolution.
old instead of the absolute value in the floating-point tensor.
KL-divergence is introduced to make a trade-off between
numerical dynamic range and precision of the INT8 represen-
It is an approximation since rounding is ignored. produce the quantized output 𝐎𝑞 . 𝐅, 𝐖, 𝐳 are the same as in
Equation 35.
𝐎𝑠32 = 𝐖𝑠8 ∗ 𝐅𝑢8 + 𝐛𝑠32
( ) 𝐎𝑞 = (𝐅𝑞 + 𝐳𝑓 × 𝑃 ) ∗ (𝐖𝑞 + 𝐳𝑤 × 𝑄)
≈ 𝑠𝑓 𝑠𝑤 𝐖𝑓 32 ∗ 𝐅𝑓 32 + 𝐛𝑓 32 (33)
= 𝑠𝑓 × 𝑠𝑤 × 𝐎𝑓 32 = 𝐅𝑞 ∗ 𝐖𝑞
+ 𝐳𝑓 × 𝐏 × 𝐖𝑞 (36)
Equation 34 is the affine transformation with FP32 format.
+ 𝐳𝑤 × 𝐐 × 𝐅𝑞
𝐷 is the dequantization factor.
+ 𝐳𝑓 × 𝐳𝑤 × 𝐏 × 𝐐
while performing inference. This is typically performed with the quantized or the quantize-dequantized weights.
batch calibration of input data. MACE also supports proces-
sor implementations optimized for ARM NEON and Qual- 𝐅𝑞 𝐖𝑞
𝐎𝑓 32 = ( × 𝐅𝑚𝑎𝑥 ) ∗ ( × 𝐖𝑚𝑎𝑥 ) (39)
comm’s Hexagon digital signal processor. OpenCL accelera- (𝑛 − 1) (𝑛 − 1)
tion is also supported. Winograd convolutions can be applied
Paddle uses max-abs in three ways to quantize parame-
for further acceleration as discussed in Section 4.2.2.
ters: 1) the average of the max absolute value in a calculation
Quantized Neural Network PACKage (QNNPACK) [61] window, 2) the max absolute value during a calculation win-
is a Facebook produced open-source library optimized for dow, and 3) a sliding average of the max absolute value of
edge computing especially for mobile low precision neural the window. The third method is described in Equation 40,
network inference. It has the same method of quantization as where 𝑉 is the max absolute value in the current batch, 𝑉𝑡 is
TF-Lite including using a zero-point. The library has been the average value of the sliding window, and 𝑘 is a coefficient
integrated into PyTorch [193] to provide users a high-level chosen by default as 0.9.
interface. In addition to Winograd and FFT convolution op- The Paddle framework uses a specialized toolset, Pad-
erations, the library has optimized gemm for cache indexing dleSlim, which supports Quantization, Pruning, Network
and feature packing. QNNPACK has a full compiled solution Architecture Search, and Knowledge Distilling. They found
for many mobile devices and has been deployed on millions 86.47% size reduction of ResNet-50, with 1.71% ILSVRC-
of devices with Facebook applications. 2012 top-1 accuracy loss.
Panel Dot product (PDOT) is a key feature of QNNPACK’s 𝑉𝑡 = (1 − 𝑘) × 𝑉 + 𝑘 × 𝑉𝑡−1 (40)
highly efficient gemm library. It assumes computing effi-
ciency is limited with memory, cache, and bandwidth instead
of Multiply and Accumulate (MAC) performance. PDOT
computes multiple dot products in parallel as shown in Fig- 4.3.3. Hardware Platforms
ure 16. Rather than loading just two operands per MAC Figure 17 shows AI chips, cards, and systems plotted
operation, PDOT loads multiple columns and rows. This im- by peak operations verses power in log scale originally pub-
proves convolution performance about 1.41× 2.23× speedup lished in [202]. Three normalizing lines are shown at 100
for MobileNet on mobile devices [61]. GOPS/Watt, 1 TOP/Watt, and 10 TOPs/Watt. Hardware plat-
forms are classified along several dimensions including: 1)
training or inference, 2) chip, card, or system form factors, 3)
datacenter or mobile, and 4) numerical precision. We focus
= × on low precision general and specialized hardware in this
section.
Table 3
Low Precision Libraries versus Accuracy for Common Networks in Multiple Frameworks.
Accuracy Float Accuracy Quant Accuracy Diff
Name Framework Method Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
AlexNet TensorRT [173] PTQ, w/o offset 57.08% 80.06% 57.05% 80.06% -0.03% 0.00%
Ristretto [90] Dynamic FP 56.90% 80.09% 56.14% 79.50% -0.76% -0.59%
Ristretto [90] Minifloat 56.90% 80.09% 52.26% 78.23% -4.64% -1.86%
Ristretto [90] Pow-of-two 56.90% 80.09% 53.57% 78.25% -3.33% -1.84%
GoogleNet NCNN [28] PTQ, w/o offset 68.50% 88.84% 68.62% 88.68% 0.12% -0.16%
TensorRT [173] PTQ, w/o offset 68.57% 88.83% 68.12% 88.64% -0.45% -0.19%
Ristretto [90] Dynamic FP 68.93% 89.16% 68.37% 88.63% -0.56% -0.53%
Ristretto [90] Minifloat 68.93% 89.16% 64.02% 87.69% -4.91% -1.47%
Ristretto [90] Pow-of-two 68.93% 89.16% 57.63% 81.38% -11.30% -7.78%
Inception v3 TF-Lite [77] PTQ 78.00% 93.80% 77.20% - -0.80% -
TF-Lite [77] QAT 78.00% 93.80% 77.50% 93.70% -0.50% -0.10%
MobileNet v1 NCNN [28] PTQ, w/o offset 67.26% 87.92% 66.74% 87.43% -0.52% -0.49%
Paddle [13] QAT+Pruning 70.91% - 69.20% - -1.71% -
TF-Lite [77] PTQ 70.90% - 65.70% - -5.20% -
TF-Lite [77] QAT 70.90% - 70.00% - -0.90% -
MobileNet v2 QNNPACK [61] PTQ, w/ offset 71.90% - 72.14% - 0.24% -
TF-Lite [77] PTQ 71.90% - 63.70% - -8.20% -
TF-Lite [77] QAT 71.90% - 70.90% - -1.00% -
ResNet-101 TensorRT [173] PTQ, w/o offset 74.39% 91.78% 74.40% 91.73% 0.01% -0.05%
TF-Lite [77] PTQ 77.00% - 76.80% - -0.20% -
ResNet-152 TensorRT [173] PTQ, w/o offset 74.78% 91.82% 74.70% 91.78% -0.08% -0.04%
ResNet-18 NCNN [28] PTQ, w/o offset 65.49% 86.56% 65.30% 86.52% -0.19% -0.04%
ResNet-50 NCNN [28] PTQ, w/o offset 71.80% 89.90% 71.76% 90.06% -0.04% 0.16%
TensorRT [173] PTQ, w/o offset 73.23% 91.18% 73.10% 91.06% -0.13% -0.12%
SqueezeNet NCNN [28] PTQ, w/o offset 57.78% 79.88% 57.82% 79.84% 0.04% -0.04%
Ristretto [90] Dynamic FP 57.68% 80.37% 57.21% 79.99% -0.47% -0.38%
Ristretto [90] Minifloat 57.68% 80.37% 54.80% 78.28% -2.88% -2.09%
Ristretto [90] Pow-of-two 57.68% 80.37% 41.60% 67.37% -16.08% -13.00%
VGG-19 TensorRT [173] PTQ, w/o offset 68.41% 88.78% 68.38% 88.70% -0.03% -0.08%
FPGA from 2.5 to 5 frames per second with 1.3% accuracy General Hardware: In addition to specialized hardware,
loss. INT8 quantization has been widely adopted in many general
TNN [6] is deployed on an FPGA with specialized com- purpose processor architectures. In this section we provide a
putation units optimized for ternary value multiplication. A high-level overview. A detailed survey on hardware efficiency
specific FPGA structure (dimensions) is determined during for processing DNNs can be found in [202].
synthesis to improve hardware efficiency. On the Sakura-X CNN acceleration on ARM CPUs was originally im-
FPGA board they achieved 255k MNIST image classifica- plemented by ARM advanced SIMD extensions known as
tions per second with an accuracy of 98.14%. A scalable de- NEON. The ARM 8.2 ISA extension added NEON support
sign implemented on a Xilinx Virtex-7 VC709 board dramat- for 8-bit integer matrix operations [8]. These were imple-
ically reduced hardware resources and power consumption mented in the CPU IP cores Cortex-A75 and A55 [9] as well
but at a significantly reduced throughput of 27k CIFAR-10 as the Mali-G76 GPU IP core [10]. These cores have been
images per second [197]. Power consumption for CIFAR-10 integrated into the Kirin SoC by Huawei, Qualcomm Snap-
was 6.8 Watts. dragon SoC, MediaTek Helio SoC, and Samsung Exynos
Reducing hardware costs is√a key objective of logarithmic [116]. For example on Exynos 9825 Octa, 8-bit integer quan-
hardware. Xu [249] adopted 2 based logarithmic quanti- tized MobileNet v2 can process an image in 19ms (52 images
zation with 5-bits of resolution. This showed 50.8% top-1 per second) using the Mali-G76 [116].
accuracy and dissipated a quarter of the power while using Intel improved the integer performance about 33% with
half the chip area. Half precision inference has a top-1 accu- Intel Advanced Vector Extension 512 (AVX-512) ISA [204].
racy of 53.8%. This 512-bit SIMD ISA extension included a Fused Multiply-
Legend
WaveSystem
Computation Precision
Data Center
Chips & DGX-2 Int1
Cards Int2
DGX-1
Arria GX1150
Baidu Int8
WaveDPU
Turing
DGX-Station Int8 -> Int16
Cambricon
Nervana2 V100
Data Center Int12 -> Int16
GOps/Second
P100
Very Low Power Mobile TPU1
GraphCoreC2 Float16 -> Float32
s /W ZCU102 AMD-MI6
Op
Xavier K80 Float32
r a
e TrueNorth Rockchip DaDianNao Zynq-060 Phi7290F
10T
AIStorm
Cell
ArriaGX1150 TrueNorthSys
Phi7210F
2xSkyLakeSP Float64
PuDianNao
GPUs ArriaGX1150
A12 XilinxCluster
s /W Form Factor
MovidiusX Mali ArriaGX1150
p Zynq-020 -76 ArriaGX1150
ra O
JetsonTX2
Mali-75 Chip
e DianNao JetsonTX1
1T Zynq-020
S845
Stratix-V
Card
ShiDianNao S835 FPGAs
p s /W System
g aO Zynq-020
Gi TPUEdge
1 00 MIT Eyeriss
Computation Type
Inference
Training
Peak Power (W)
Presentation Name - 26 of
Author Initials MM/DD/YY Slide courtesy
Figure 17: Hardware of Albertfor
platforms Reuther,
neuralMIT Lincoln Laboratory
networks efficiencySupercomputing Centerfrom [202].
deploy, adopted
Add (FMA) instruction. should be properly stored for further deployment. Since
Low precision computation on nVidia GPUs was enabled the inference engine is uncertain, the stored IR should
since the Pascal series of GPUs [184]. The Turing GPU archi- include the model architecture and the trained weights.
tecture [188] introduced specialized units to processes INT4 A compiler can then read the model and optimize it for
and INT8. This provides real-time integer performance on AI a specific inference engine.
algorithms used in games. For embedded platforms, nVidia
developed Jetson platforms [187]. They use CUDA Maxwell • Compression: Compilers and optimizers should op-
cores [183] that can process half-precision types. For the data tionally be able to automatically compress arbitrary
center, nVidia developed the extremely high performance network structures using pruning and quantization.
DGX system [185]. It contains multiple high-end GPUs • Deployment: The final optimized model should be
interconnected using nVidia’s proprietary bus nVLINK. A mapped
DGX system can perform 4-bit integer to 32-bit floating point to the target engine(s) which may be heterogeneous.
operations.
Open Neural Network Exchange (ONNX) [190] is an
4.3.4. DNN Compilers open-source tool to parse AI models written for a variety
Heterogeneous neural networks hardware accelerators diverse frameworks. It imports and exports models using
are accelerating deep learning algorithm deployment [202]. an open-source format facilitating the translation of neural
Often exchange formats can be used to import/export models. network models between frameworks. It is thus capable of
Further, compilers have been developed to optimize models network parsing provided low-level operations are defined in
and generate code for specific processors. However several all target frameworks.
challenges remain: TVM [36], Glow [205], OpenVINO [118], and MLIR
[134] are deep learning compilers. They differ from frame-
• Network Parsing: Developers design neural network works such as Caffe in that they store intermediate repre-
models on different platforms using various frame- sentations and optimize those to map models onto specific
works and programming languages. However, they hardware engines. They typically integrate both quantization-
have common parts, such as convolution, activation, aware training and calibration-based post-training quantiza-
pooling, etc. Parsing tools analyze the model composi- tion. We summarize key features below. They perform all the
tions and transfer them into the unified representation. operations noted in our list. A detailed survey can be found
• Structure Optimization: The model may contain opera- in [149].
tions used in training that aren’t required for inference. TVM [36] leverages the efficiency of quantization by
Tool-kits and compilers should optimize these struc- enabling deployment of quantized models from PyTorch and
tures (e.g. BN folding as discussed in Section 2.5). TF-Lite. As a compiler, TVM has the ability to map the
model on general hardware such as Intel’s AVX and nVidia’s
• Intermediate Representation (IR): An optimized model CUDA.
Glow [205] enables quantization with zero points and further compression. Network pruning can be viewed as a
converts the data into 8-bit signed integers using a calibration- subset of NAS but with a smaller searching space. This is
based method. Neither Glow or TVM currently support especially true when the pruned architecture no longer needs
quantization-aware training although they both announced to use weights from the unpruned network (see Section 3.3).
future support for it [205]. In addition, some NAS techniques can also be applied to the
MLIR [134] and OpenVINO [118] have sophisticated pruning approach including borrowing trained coefficients
quantization support including quantization-aware training. and reinforcement learning search.
OpenVINO integrates it in TensorFlow and PyTorch while Typically, compression is evaluated on large data-sets
MLIR natively supports quantization-aware training. This such as the ILSVRC-2012 dataset with one thousand object
allows users to fine-tune an optimized model when it doesn’t categories. In practice, resource constraints in embedded
satisfy accuracy criteria. devices don’t allow a large capacity of optimized networks.
Compressing a model to best fit a constrained environment
4.4. Quantization Reduces Over-fitting should consider but not be limited to the deployment envi-
In addition to accelerating neural networks, quantization ronment, target device, speed/compression trade-offs, and
has also been found in some cases to result in higher accu- accuracy requirements [29].
racy. As examples: 1) 3-bit weights VGG-16 outperforms its Based on the reviewed pruning techniques, we recom-
full precision counterpart by 1.1% top-1 [144], 2) AlexNet re- mend the following for effective pruning:
duces 1.0% top-1 error of the reference with 2-bit weights and
8-bit activations [66], 3) ResNet-34 with 4-bit weights and ac- • Uniform pruning introduces accuracy loss therefore
tivation obtained 74.52% top-1 accuracy while the 32-bit ver- setting the pruning ratio to vary by layers is better
sion is 73.59% [174], 4) Zhou showed a quantized model re- [159].
duced the classification error by 0.15%, 2.28%, 0.13%, 0.71%,
• Dynamic pruning may result in higher accuracy and
and 1.59% on AlexNet, VGG-16, GoogLeNet, ResNet-18 and
maintain higher network capacity [246].
ResNet-50, respectively [269], and 5) Xu showed reduced
bit quantized networks help to reduce over-fitting on Fully • Structurally pruning a network may benefit from ma-
Connected Networks (FCNs). By taking advantage of strict turing libraries especially when pruning at a high level
constraints in biomedical image segmentation they improved [241].
segmentation accuracy by 1% combined with a 6.4× memory
usage reduction [251]. • Training a pruned model from scratch sometimes, but
not always (see Section 3.3), is more efficient than
tunning from the unpruned weights [160].
5. Summary
In this section we summarize the results of Pruning and • Penalty-based pruning typically reduces accuracy loss
Quantization. compared with magnitude-based pruning [255]. How-
ever, recent efforts are narrowing the gap [72].
5.1. Pruning
Section 3 shows pruning is an important technique for 5.2. Quantization
compressing neural networks. In this paper, we discussed Section 4 discusses quantization techniques. It describes
pruning techniques categorized as 1) static pruning and 2) binarized quantized neural networks, and reduced precision
dynamic pruning. Previously, static pruning was the domi- networks, along with their training methods. We described
nant area of research. Recently, dynamic pruning has become low-bit dataset validation techniques and results. We also
a focus because it can further improve performance even if list the accuracy of popular quantization frameworks and
static pruning has first been performed. described hardware implementations in Section 4.3.
Pruning can be performed in multiple ways. Element- Quantization usually results in a loss of accuracy due
wise pruning improves weight compression and storage. to information lost during the quantization process. This is
Channel-wise and shape-wise pruning can be accelerated particularly evident on compact networks. Most of the early
with specialized hardware and computation libraries. Filter- low bit quantization approaches only compare performance
wise and layer-wise pruning can dramatically reduce compu- on small datasets (e.g., MNIST, and CIFAR-10) [58, 94, 156,
tational complexity. 200, 235, 269]. However, observations showed that some
Though pruning sometimes introduces incremental im- quantized networks could outperform the original network
provement in accuracy by escaping a local minima [12], ac- (see: Section 4.4). Additionally, non-uniform distribution
curacy improvements are better realized by switching to a data may lead to further deterioration in quantization per-
better network architecture [24]. For example, a separable formance [275]. Sometimes this can be ameliorated by nor-
block may provide better accuracy with reduced computa- malization in fine-tuning [172] or by non-linear quantization
tional complexity [105]. Considering the evolution of net- (e.g., log representation) [175].
work structures, performance may also be bottlenecked by Advanced quantization techniques have improved accu-
the structure itself. From this point of view, Network Archi- racy. Asymmetric quantization [120] maintains higher dy-
tecture Search and Knowledge Distillation can be options for namic range by using a zero point in addition to a regular
scale parameter. Overheads introduced by the zero point were 155]. Automatic quantization is a technique to automatically
minimized by pipelining the processing unit. Calibration search quantization encoding to evaluate accuracy loss verses
based quantization [173] removed zero points and replaced compression ratio. Similarly, automatic prunning is a tech-
them with precise scales obtained from a calibrating dataset. nique to automatically search different prunning approaches
Quantization-aware training was shown to further improve to evaluate the sparsity ratio versus accuracy. Similar to hy-
quantization accuracy. perparameter tuning [257], this can be performed without
8-bit quantization is widely applied in practice as a good human intervention using any number of search techniques
trade-off between accuracy and compression. It can easily be (e.g. random search, genetic search, etc.).
deployed on current processors and custom hardware. Mini- Compression on Other Types of Neural Networks. Cur-
mal accuracy loss is experienced especially when quantization- rent compression research is primarily focused on CNNs.
aware training is enabled. Binarized networks have also More specifically, research is primarily directed towards CNN
achieved reasonable accuracy with specialized hardware de- classification tasks. Future work should also consider other
signs. types of applications such as object detection, speech recogni-
Though BN has advantages to help training and prun- tion, language translation, etc. Network compression verses
ing, an issue with BN is that it may require a large dynamic accuracy for different applications is an interesting area of
range across a single layer kernel or between different chan- research.
nels. This may make layer-wise quantization more difficult. Hardware Adaptation. Hardware implementations may
Because of this per channel quantization is recommended limit the effectiveness of pruning algorithms. For example,
[131]. element-wise pruning only slightly reduces computations or
To achieve better accuracy following quantization, we bandwidth when using im2col-gemm on GPU [264]. Sim-
recommend: ilarly, shape-wise pruning is not typically able to be imple-
mented on dedicated CNN accelerators. Hardware-software
• Use asymmetrical quantization. It preserves flexibility
co-design of compression techniques for hardware acceler-
over the quantization range even though it has compu-
ators should be considered to achieve the best system effi-
tational overheads [120].
ciency.
• Quantize the weights rather than the activations. Acti- Global Methods. Network optimizations are typically
vation is more sensitive to numerical precision [75]. applied separately without information from one optimiza-
tion informing any other optimization. Recently, approaches
• Do not quantize biases. They do not require significant that consider optimization effectiveness at multiple layers
storage. High precision biases in all layers [114], and have been proposed. [150] discusses pruning combined with
first/last layers [200, 272], maintain higher network tensor factorization that results in better overall compression.
accuracy. Similar techniques can be considered using different types
• Quantize kernels channel-wise instead of layer-wise to and levels of compression and factorization.
significantly improve accuracy [131].
• Fine-tune the quantized model. It reduces the accuracy 7. Conclusions
gap between the quantized model and the real-valued Deep neural networks have been applied in many applica-
model [244]. tions exhibiting extraordinary abilities in the field of computer
vision. However, complex network architectures challenge
• Initially train using a 32-bit floating point model. Low- efficient real-time deployment and require significant compu-
bit quantized model can be difficult to train from scratch tation resources and energy costs. These challenges can be
- especially compact models on large-scaled data-sets overcome through optimizations such as network compres-
[272]. sion. Network compression can often be realized with little
• The sensitivity of quantization is ordered as gradients, loss of accuracy. In some cases accuracy may even improve.
activations, and then weights [272]. Pruning can be categorized as static (Section 3.1) if it is
performed offline or dynamic (Section 3.2) if it is performed
• Stochastic quantization of gradients is necessary when at run-time. The criteria applied to removing redundant com-
training quantized models [89, 272]. putations if often just a simple magnitude of weights with
values near zero being pruned. More complicated methods
6. Future Work include checking the 𝑙𝑝 -norm. Techniques such as LASSO
and Ridge are built around 𝑙1 and 𝑙2 norms. Pruning can
Although punning and quantization algorithms help re- be performed element-wise, channel-wise, shape-wise, filter-
duce the computation cost and bandwidth burden, there are wise, layer-wise and even network-wise. Each has trade-offs
still areas for improvement. In this section we highlight future in compression, accuracy, and speedup.
work to further improvement quantization and prunning. Quantization reduces computations by reducing the preci-
Automatic Compression. Low bit width quantization sion of the datatype. Most networks are trained using 32-bit
can cause significant accuracy loss, especially when the quan- floating point. Weights, biases, and activations may then be
tized bit-width is very narrow and the dataset is large [272,
Model Deployment W A Top-1 Top-5 Ref. Model Deployment W A Top-1 Top-5 Ref.
QIL 4 4 0.00% - [127] SYQ 1 8 5.40% 3.40% [66]
QIL 5 5 0.00% - [127] DoReFa-Net 4 4 5.50% 3.30% [272]
WRPN-2x 4 2 0.01% - [174] DoReFa-Net 5 5 5.50% -0.20% [272]
WRPN-2x 2 4 0.09% - [174] FGQ 2 8 5.60% - [170]
WRPN-2x 2 2 0.27% - [174] ABC-Net 5 5 6.30% 3.50% [155]
SeerNet 4 1 0.35% 0.17% [32] FGQ-TWN 2 4 6.67% - [170]
Unified INT8 8 8 0.39% - [275] HWGQ 1 2 6.90% 4.60% [31]
LCCL 0.43% 0.17% [59] ResNet-100 IAO 8 8 1.40% - [120]
QIL 3 3 0.60% - [127] ResNet-101 TensorRT 8 8 -0.01% 0.05% [173]
WRPN-3x 1 1 0.90% - [174] FGQ-TWN 2 8 3.65% - [170]
WRPN-3x 1 1 1.21% - [174] FGQ-TWN 2 4 6.81% - [170]
GroupNet-8 1 1 1.40% 1.00% [276] ResNet-150 IAO 8 8 2.10% - [120]
dLAC 2 16 1.67% 0.89% [235] ResNet-152 TensorRT 8 8 0.08% 0.04% [173]
LQ-NETs 3 3 1.90% 1.20% [262] dLAC 2 16 1.20% 0.64% [235]
GroupNet**-5 1 1 2.70% 2.10% [276]
IR-Net 1 32 2.90% 1.80% [199] SqueezeNet AngleEye 16 16 0.00% 0.01% [85]
QIL 2 2 3.10% - [127] ShiftCNN 3 4 0.01% 0.01% [84]
WRPN-2x 1 1 3.40% - [174] ShiftCNN 2 4 1.01% 0.71% [84]
WRPN-2x 1 1 3.74% - [174] AngleEye 8 8 1.42% 1.05% [85]
LQ-NETs 2 2 4.00% 2.30% [262] AngleEye 6 6 28.13% 27.43% [85]
GroupNet-5 1 1 4.70% 3.40% [276] ShiftCNN 1 4 35.39% 35.09% [84]
ABC-Net 5 5 4.90% 3.10% [155] VGG-16 ELNN 3(±4) 32 -1.10% -1.00% [144]
HWGQ 1 32 5.10% 3.40% [31] ELNN 3(±2) 32 -0.60% -0.80% [144]
WAGEUBN 8 8 5.18% - [254] AngleEye 16 16 0.09% -0.05% [85]
ABC-Net 3 3 6.60% 3.90% [155] DFP16 16 16 0.11% 0.29% [54]
LQ-NETs 1 2 6.70% 4.40% [262] AngleEye 8 8 0.21% 0.08% [85]
LQ-NETs 4 4 6.70% 4.40% [262] SeerNet 4 1 0.28% 0.10% [32]
BCGD 1 4 7.60% 4.70% [256] DeepShift-Q 6 32 0.29% 0.11% [62]
HWGQ 1 2 9.00% 5.60% [31] FFN 2 32 0.30% -0.20% [238]
IR-Net 1 1 9.50% 6.20% [199] DeepShift-PS 6 32 0.47% 0.30% [62]
CI-BCNN (add) 1 1 11.07% 6.39% [240] DeepShift-Q 6 32 0.72% 0.29% [62]
Bi-Real 1 1 11.10% 7.40% [252] INQ 5 32 0.77% 0.08% [62]
WRPN-1x 1 1 12.80% - [174] TWN 2 32 1.10% 0.30% [146]
WRPN 1 1 13.05% - [174] ELNN 2 32 2.00% 0.90% [144]
CI-BCNN 1 1 13.59% 8.65% [240] TSQ 2 2 2.00% 0.70% [239]
DoReFa-Net 1 4 14.60% - [272] AngleEye 16 16 2.15% 1.49% [85]
DoReFa-Net 1 2 20.40% - [272] BWN 2 32 2.20% 1.20% [200]
ABC-Net 1 1 20.90% 14.80% [155] AngleEye 8 8 2.35% 1.76% [85]
BNN 1 1 29.10% 24.20% [272] ELNN 1 32 3.30% 1.80% [144]
ResNet-50 Mixed-Precision 16 16 -0.12% - [172] AngleEye 6 6 9.07% 6.58% [85]
DFP16 16 16 -0.07% -0.06% [54] AngleEye 6 6 22.38% 17.75% [85]
QuantNet 5 32 0.00% 0.00% [253] LogQuant 3 3 - 0.99% [30]
LQ-NETs 4 32 0.00% 0.10% [262] LogQuant 4 4 - 0.51% [30]
FGQ 32 32 0.00% - [170] LogQuant 6 6 - 0.83% [30]
TensorRT 8 8 0.13% 0.12% [173] LogQuant 32 3 - 0.82% [30]
PACT 5 5 0.20% -0.20% [44] LogQuant 32 4 - 0.36% [30]
QuantNet 3(±4) 32 0.20% 0.00% [253] LogQuant 32 6 - 0.31% [30]
Unified INT8 8 8 0.26% - [275] LogQuant 6 32 - 0.76% [30]
ShiftCNN 3 4 0.29% 0.15% [84] LDR 5 4 - 0.90% [175]
ShiftCNN 3 4 0.31% 0.16% [84] LogNN 5 4 - 1.38% [175]
PACT 4 4 0.40% -0.10% [44]
LPBN 32 5 0.40% 0.40% [81]
ShiftCNN 2 4 0.67% 0.41% [84]
DeepShift-Q 6 32 0.81% 0.21% [62]
DeepShift-PS 6 32 0.84% 0.31% [62]
PACT 5 32 0.90% 0.20% [44]
QuantNet 3(±2) 32 0.90% 0.40% [253]
PACT 4 32 1.00% 0.20% [44]
dLAC 2 16 1.20% - [235]
QuantNet 2 32 1.20% 0.60% [253]
AddNN 32 32 1.30% 1.20% [35]
LQ-NETs 4 4 1.30% 0.80% [262]
LQ-NETs 2 32 1.30% 0.90% [262]
INQ 5 32 1.32% 0.41% [269]
PACT 3 32 1.40% 0.50% [44]
IAO 8 8 1.50% - [120]
PACT 3 3 1.60% 0.50% [44]
HAQ 2MP 4MP 1.91% - [236]
HAQ MP MP 2.09% - [236]
LQ-NETs 3 3 2.20% 1.60% [262]
LPBN 32 4 2.20% 1.20% [81]
Deep Comp. 3 MP 2.29% - [92]
PACT 4 2 2.40% 1.20% [44]
ShiftCNN 2 4 2.49% 1.64% [84]
FFN 2 32 2.50% 1.30% [238]
UNIQ 4 8 2.60% - [18]
QuantNet 1 32 3.20% 1.70% [253]
SYQ 2 8 3.70% 2.10% [66]
FGQ-TWN 2 8 4.29% - [170]
PACT 2 2 4.70% 2.60% [44]
LQ-NETs 2 2 4.90% 2.90% [262]
[34] Chellapilla, K., Puri, S., Simard, P., 2006. High Performance Con- Artificial Intelligence Review 53, 5113–5155. URL: https://doi.
volutional Neural Networks for Document Processing, in: Tenth org/10.1007/s10462-020-09816-7, doi:10.1007/s10462-020-09816-7.
International Workshop on Frontiers in Handwriting Recognition. [49] Cornea, M., 2015. Intel ® AVX-512 Instructions and Their Use in
URL: https://hal.inria.fr/inria-00112631/, doi:10.1.1.137.482. the Implementation of Math Functions. Intel Corporation .
[35] Chen, H., Wang, Y., Xu, C., Shi, B., Xu, C., Tian, Q., Xu, C., 2020. [50] Cotofana, S., Vassiliadis, S., Logic, T., Addition, B., Addition, S.,
AdderNet: Do We Really Need Multiplications in Deep Learning?, in: 1997. Low Weight and Fan-In Neural Networks for Basic Arithmetic
Proceedings of the IEEE/CVF Conference on Computer Vision and Operations, in: 15th IMACS World Congress, pp. 227–232. doi:10.
Pattern Recognition (CVPR), pp. 1468–1477. URL: http://arxiv. 1.1.50.4450.
org/abs/1912.13200. [51] Courbariaux, M., Bengio, Y., David, J.P., 2014. Training deep neu-
[36] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., ral networks with low precision multiplications, in: International
Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., Krishnamurthy, Conference on Learning Representations(ICLR), pp. 1–10. URL:
A., 2018. TVM: An automated end-to-end optimizing compiler for http://arxiv.org/abs/1412.7024, doi:arXiv:1412.7024.
deep learning, in: Proceedings of the 13th USENIX Symposium [52] Courbariaux, M., Bengio, Y., David, J.P., 2015. BinaryConnect:
on Operating Systems Design and Implementation, OSDI 2018, pp. Training Deep Neural Networks with binary weights during propa-
579–594. URL: http://arxiv.org/abs/1802.04799. gations, in: Advances in Neural Information Processing Systems
[37] Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y., 2015. (NIPS), pp. 1–9. URL: http://arxiv.org/abs/1511.00363, doi:10.
Compressing neural networks with the hashing trick., in: In Inter- 5555/2969442.2969588.
national Conference on Machine Learning, pp. 2285–2294. URL: [53] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio,
http://arxiv.org/abs/1504.04788. Y., 2016. Binarized Neural Networks: Training Deep Neural
[38] Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O., 2016. DianNao Networks with Weights and Activations Constrained to +1 or -1.
family: Energy-Efficient Hardware Accelerators for Machine ArXiv preprint URL: https://github.com/MatthieuCourbariaux/http:
Learning. Communications of the ACM 59, 105–112. URL: //arxiv.org/abs/1602.02830.
10.1145/2594446%5Cnhttps://ejwl.idm.oclc.org/login?url=http: [54] Das, D., Mellempudi, N., Mudigere, D., Kalamkar, D., Avancha, S.,
//search.ebscohost.com/login.aspx?direct=true&db=bth&AN= Banerjee, K., Sridharan, S., Vaidyanathan, K., Kaul, B., Georganas,
95797996&site=ehost-livehttp://dl.acm.org/citation.cfm?doid= E., Heinecke, A., Dubey, P., Corbal, J., Shustrov, N., Dubtsov,
3013530.2996864, doi:10.1145/2996864. R., Fomenko, E., Pirogov, V., 2018. Mixed Precision Training of
[39] Cheng, J., Wang, P.s., Li, G., Hu, Q.h., Lu, H.q., 2018. Recent ad- Convolutional Neural Networks using Integer Operations, in: In-
vances in efficient computation of deep convolutional neural networks. ternational Conference on Learning Representations(ICLR),
Frontiers of Information Technology & Electronic Engineering pp. 1–11. URL: https://www.anandtech.com/show/11741/
19, 64–77. URL: http://link.springer.com/10.1631/FITEE.1700789, hot-chips-intel-knights-mill-live-blog-445pm-pt-1145pm-utchttp:
doi:10.1631/FITEE.1700789. //arxiv.org/abs/1802.00930.
[40] Cheng, Y., Wang, D., Zhou, P., Zhang, T., 2017. A Survey of Model [55] Dash, M., Liu, H., 1997. Feature selection for classification. Intelli-
Compression and Acceleration for Deep Neural Networks. ArXiv gent Data Analysis 1, 131–156. doi:10.3233/IDA-1997-1302.
preprint URL: http://arxiv.org/abs/1710.09282. [56] Davis, A., Arel, I., 2013. Low-Rank Approximations for Conditional
[41] Cheng, Z., Soudry, D., Mao, Z., Lan, Z., 2015. Training Binary Mul- Feedforward Computation in Deep Neural Networks, in: International
tilayer Neural Networks for Image Classification using Expectation Conference on Learning Representations Workshops (ICLRW), pp.
Backpropagation. ArXiv preprint URL: http://cn.arxiv.org/pdf/ 1–10. URL: http://arxiv.org/abs/1312.4461.
1503.03562.pdfhttp://arxiv.org/abs/1503.03562. [57] Deng, W., Yin, W., Zhang, Y., 2013. Group sparse optimiza-
[42] Chiliang, Z., Tao, H., Yingda, G., Zuochang, Y., 2019. Accelerating tion by alternating direction method, in: Van De Ville, D.,
Convolutional Neural Networks with Dynamic Channel Pruning, in: Goyal, V.K., Papadakis, M. (Eds.), Wavelets and Sparsity XV,
2019 Data Compression Conference (DCC), IEEE. pp. 563–563. p. 88580R. URL: http://proceedings.spiedigitallibrary.org/
URL: https://ieeexplore.ieee.org/document/8712710/, doi:10.1109/ proceeding.aspx?doi=10.1117/12.2024410, doi:10.1117/12.2024410.
DCC.2019.00075. [58] Dettmers, T., 2015. 8-Bit Approximations for Parallelism
[43] Choi, B., Lee, J.H., Kim, D.H., 2008. Solving local minima in Deep Learning, in: International Conference on Learn-
problem with large number of hidden nodes on two-layered feed- ing Representations(ICLR). URL: https://github.com/soumith/
forward artificial neural networks. Neurocomputing 71, 3640–3643. convnet-benchmarkshttp://arxiv.org/abs/1511.04561.
doi:10.1016/j.neucom.2008.04.004. [59] Dong, X., Huang, J., Yang, Y., Yan, S., 2017. More is less: A more
[44] Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.j., Srinivasan, V., complicated network with less inference complexity. Proceedings -
Gopalakrishnan, K., 2018. PACT: Parameterized Clipping Activation 30th IEEE Conference on Computer Vision and Pattern Recognition,
for Quantized Neural Networks. ArXiv preprint , 1–15URL: http: CVPR 2017 2017-Janua, 1895–1903. URL: http://arxiv.org/abs/
//arxiv.org/abs/1805.06085. 1703.08651, doi:10.1109/CVPR.2017.205.
[45] Choi, Y., El-Khamy, M., Lee, J., 2017a. Towards the Limit of Network [60] Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S., 1990. A set
Quantization, in: International Conference on Learning Represen- of level 3 basic linear algebra subprograms. ACM Transactions on
tations(ICLR), IEEE. URL: https://arxiv.org/abs/1612.01543http: Mathematical Software (TOMS) 16, 1–17. doi:10.1145/77626.79170.
//arxiv.org/abs/1612.01543. [61] Dukhan, M., Yiming, W., Hao, L., Lu, H., 2019. QNNPACK:
[46] Choi, Y., Member, S.S., Bae, D., Sim, J., Member, S.S., Choi, S., Open source library for optimized mobile deep learning - Facebook
Kim, M., Member, S.S., Kim, L.s.S., Member, S.S., 2017b. Energy- Engineering. URL: https://engineering.fb.com/ml-applications/
Efficient Design of Processing Element for Convolutional Neural qnnpack/.
Network. IEEE Transactions on Circuits and Systems II: Express [62] Elhoushi, M., Chen, Z., Shafiq, F., Tian, Y.H., Li, J.Y., 2019.
Briefs 64, 1332–1336. URL: http://ieeexplore.ieee.org/document/ DeepShift: Towards Multiplication-Less Neural Networks. ArXiv
7893765/, doi:10.1109/TCSII.2017.2691771. preprint URL: http://arxiv.org/abs/1905.13298.
[47] Chollet, F., Google, C., 2017. Xception : Deep Learning with [63] Elsken, T., Metzen, J.H., Hutter, F., 2019. Neural Architec-
Depthwise Separable Convolutions, in: The IEEE Conference on ture Search. Journal of Machine Learning Research 20, 63–
Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 1251– 77. URL: http://link.springer.com/10.1007/978-3-030-05318-5_3,
1258. URL: http://ieeexplore.ieee.org/document/8099678/, doi:10. doi:10.1007/978-3-030-05318-5{\_}3.
1109/CVPR.2017.195. [64] Engelbrecht, A.P., 2001. A new pruning heuristic based on variance
[48] Choudhary, T., Mishra, V., Goswami, A., Sarangapani, J., 2020. analysis of sensitivity information. IEEE Transactions on Neural
A comprehensive survey on model compression and acceleration. Networks 12, 1386–1389. doi:10.1109/72.963775.
[65] Esser, S.K., Merolla, P.A., Arthur, J.V., Cassidy, A.S., Appuswamy, 08983.
R., Andreopoulos, A., Berg, D.J., McKinstry, J.L., Melano, T., Barch, [83] Greff, K., Srivastava, R.K., Schmidhuber, J., 2016. Highway and
D.R., di Nolfo, C., Datta, P., Amir, A., Taba, B., Flickner, M.D., Residual Networks learn Unrolled Iterative Estimation, in: Inter-
Modha, D.S., 2016. Convolutional networks for fast, energy-efficient national Conference on Learning Representations(ICLR), pp. 1–14.
neuromorphic computing. Proceedings of the National Academy URL: http://arxiv.org/abs/1612.07771.
of Sciences 113, 11441–11446. URL: http://www.pnas.org/lookup/ [84] Gudovskiy, D.A., Rigazio, L., 2017. ShiftCNN: Generalized Low-
doi/10.1073/pnas.1604850113, doi:10.1073/pnas.1604850113. Precision Architecture for Inference of Convolutional Neural Net-
[66] Faraone, J., Fraser, N., Blott, M., Leong, P.H., 2018. SYQ: Learning works. ArXiv preprint URL: http://arxiv.org/abs/1706.02393.
Symmetric Quantization for Efficient Deep Neural Networks, in: [85] Guo, K., Sui, L., Qiu, J., Yu, J., Wang, J., Yao, S., Han, S., Wang, Y.,
Proceedings of the IEEE/CVF Conference on Computer Vision and Yang, H., 2018. Angel-Eye: A complete design flow for mapping
Pattern Recognition (CVPR). CNN onto embedded FPGA. IEEE Transactions on Computer-Aided
[67] Fiesler, E., Choudry, A., Caulfield, H.J., 1990. Weight discretization Design of Integrated Circuits and Systems 37, 35–47. URL: https://
paradigm for optical neural networks. Optical Interconnections and ieeexplore.ieee.org/abstract/document/7930521/, doi:10.1109/TCAD.
Networks 1281, 164. doi:10.1117/12.20700. 2017.2705069.
[68] Figurnov, M., Collins, M.D., Zhu, Y., Zhang, L., Huang, J., Vetrov, [86] Guo, K., Zeng, S., Yu, J., Wang, Y., Yang, H., 2017. A Survey of
D., Salakhutdinov, R., 2017. Spatially Adaptive Computation Time FPGA-Based Neural Network Accelerator. ACM Transactions on
for Residual Networks, in: IEEE/CVF Conference on Computer Reconfigurable Technology and Systems 9. URL: http://arxiv.org/
Vision and Pattern Recognition (CVPR), IEEE. pp. 1790–1799. abs/1712.08934https://arxiv.org/abs/1712.08934.
URL: http://ieeexplore.ieee.org/document/8099677/, doi:10.1109/ [87] Guo, Y., 2018. A Survey on Methods and Theories of Quantized
CVPR.2017.194. Neural Networks. ArXiv preprint URL: http://arxiv.org/abs/1808.
[69] FPGA, I., . Intel® FPGA Development Tools - Intel 04752.
FPGA. URL: https://www.intel.com/content/www/us/en/software/ [88] Guo, Y., Yao, A., Chen, Y., 2016. Dynamic Network Surgery for
programmable/overview.html. Efficient DNNs, in: Advances in Neural Information Processing Sys-
[70] Frankle, J., Carbin, M., 2019. The lottery ticket hypothesis: Finding tems (NIPS), pp. 1379–1387. URL: http://papers.nips.cc/paper/
sparse, trainable neural networks, in: International Conference on 6165-dynamic-network-surgery-for-efficient-dnns.
Learning Representations(ICLR). URL: http://arxiv.org/abs/1803. [89] Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P., 2015.
03635. Deep learning with limited numerical precision, in: International
[71] Fukushima, K., 1988. Neocognitron: A hierarchical neural network Conference on Machine Learning (ICML), pp. 1737–1746.
capable of visual pattern recognition. Neural Networks 1, 119–130. [90] Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S., 2018. Ristretto: A
doi:10.1016/0893-6080(88)90014-7. Framework for Empirical Study of Resource-Efficient Inference in
[72] Gale, T., Elsen, E., Hooker, S., 2019. The State of Sparsity in Deep Convolutional Neural Networks. IEEE Transactions on Neural Net-
Neural Networks. ArXiv preprint URL: http://arxiv.org/abs/1902. works and Learning Systems 29, 1–6. URL: https://ieeexplore.ieee.
09574. org/abstract/document/8318896/, doi:10.1109/TNNLS.2018.2808319.
[73] Gao, X., Zhao, Y., Dudziak, L., Mullins, R., Xu, C.Z., Dudziak, L., [91] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A.,
Mullins, R., Xu, C.Z., 2019. Dynamic Channel Pruning: Feature Dally, W.J., 2016a. EIE: Efficient Inference Engine on Compressed
Boosting and Suppression, in: International Conference on Learning Deep Neural Network, in: 2016 ACM/IEEE 43rd Annual Inter-
Representations (ICLR), pp. 1–14. URL: http://arxiv.org/abs/1810. national Symposium on Computer Architecture (ISCA), IEEE. pp.
05331. 243–254. URL: http://ieeexplore.ieee.org/document/7551397/http:
[74] Glossner, J., Blinzer, P., Takala, J., 2016. HSA-enabled DSPs and //arxiv.org/abs/1602.01528, doi:10.1109/ISCA.2016.30.
accelerators. 2015 IEEE Global Conference on Signal and Informa- [92] Han, S., Mao, H., Dally, W.J., 2016b. Deep compression: Compress-
tion Processing, GlobalSIP 2015 , 1407–1411doi:10.1109/GlobalSIP. ing deep neural networks with pruning, trained quantization and Huff-
2015.7418430. man coding, in: International Conference on Learning Representa-
[75] Gong, R., Liu, X., Jiang, S., Li, T., Hu, P., Lin, J., Yu, F., Yan, J., tions(ICLR), pp. 199–203. URL: http://arxiv.org/abs/1510.00149.
2019. Differentiable soft quantization: Bridging full-precision and [93] Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen,
low-bit neural networks, in: Proceedings of the IEEE International E., Vajda, P., Paluri, M., Tran, J., Catanzaro, B., Dally, W.J., 2016c.
Conference on Computer Vision (ICCV), pp. 4851–4860. doi:10. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, in:
1109/ICCV.2019.00495. International Conference on Learning Representations(ICLR). URL:
[76] Gong, Y., Liu, L., Yang, M., Bourdev, L., 2014. Compressing Deep http://arxiv.org/abs/1607.04381.
Convolutional Networks using Vector Quantization, in: International [94] Han, S., Pool, J., Tran, J., Dally, W.J., 2015. Learning both Weights
Conference on Learning Representations(ICLR). URL: http://arxiv. and Connections for Efficient Neural Networks, in: Advances in
org/abs/1412.6115. Neural Information Processing Systems (NIPS), pp. 1135–1143.
[77] Google, . Hosted models | TensorFlow Lite. URL: https://www. URL: http://arxiv.org/abs/1506.02626, doi:10.1016/S0140-6736(95)
tensorflow.org/lite/guide/hosted_models. 92525-2.
[78] Google, 2018. google/gemmlowp: Low-precision matrix mul- [95] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E.,
tiplication. https://github.com/google/gemmlowp. URL: https: Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y., 2014.
//github.com/google/gemmlowp. Deep Speech: Scaling up end-to-end speech recognition. ArXiv
[79] Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., Choi, preprint , 1–12URL: http://arxiv.org/abs/1412.5567.
E., 2018. MorphNet: Fast & Simple Resource-Constrained Struc- [96] HANSON, S., 1989. Comparing biases for minimal network con-
ture Learning of Deep Networks, in: Proceedings of the IEEE/CVF struction with back-propagation, in: Advances in Neural Information
Conference on Computer Vision and Pattern Recognition (CVPR), Processing Systems (NIPS), pp. 177–185.
IEEE. pp. 1586–1595. URL: https://ieeexplore.ieee.org/document/ [97] Hassibi, B., Stork, D.G., Wolff, G.J., 1993. Optimal brain surgeon
8578269/, doi:10.1109/CVPR.2018.00171. and general network pruning. doi:10.1109/icnn.1993.298572.
[80] Gou, J., Yu, B., Maybank, S.J., Tao, D., 2020. Knowledge Distillation: [98] He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep Residual Learn-
A Survey. ArXiv preprint URL: http://arxiv.org/abs/2006.05525. ing for Image Recognition, in: IEEE/CVF Conference on Com-
[81] Graham, B., 2017. Low-Precision Batch-Normalized Activations. puter Vision and Pattern Recognition (CVPR), IEEE. pp. 171–180.
ArXiv preprint , 1–16URL: http://arxiv.org/abs/1702.08231. URL: http://arxiv.org/abs/1512.03385http://ieeexplore.ieee.org/
[82] Graves, A., 2016. Adaptive Computation Time for Recurrent Neural document/7780459/, doi:10.3389/fpsyg.2013.00124.
Networks. ArXiv preprint , 1–19URL: http://arxiv.org/abs/1603. [99] He, Y., Kang, G., Dong, X., Fu, Y., Yang, Y., 2018. Soft Filter
Pruning for Accelerating Deep Convolutional Neural Networks, in: URL: https://arxiv.org/abs/1602.07360http://arxiv.org/abs/1602.
Proceedings of the Twenty-Seventh International Joint Conference on 07360, doi:10.1007/978-3-319-24553-9.
Artificial Intelligence (IJCAI-18), International Joint Conferences on [116] Ignatov, A., Timofte, R., Kulik, A., Yang, S., Wang, K., Baum, F.,
Artificial Intelligence Organization, California. pp. 2234–2240. URL: Wu, M., Xu, L., Van Gool, L., 2019. AI benchmark: All about
http://arxiv.org/abs/1808.06866, doi:10.24963/ijcai.2018/309. deep learning on smartphones in 2019. Proceedings - 2019 In-
[100] He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y., 2019. Filter Pruning ternational Conference on Computer Vision Workshop, ICCVW
via Geometric Median for Deep Convolutional Neural Networks 2019 , 3617–3635URL: https://developer.arm.com/documentation/
Acceleration. IEEE/CVF Conference on Computer Vision and Pattern ddi0487/latest, doi:10.1109/ICCVW.2019.00447.
Recognition (CVPR) URL: http://arxiv.org/abs/1811.00250. [117] Imagination, . PowerVR - embedded graphics processors
[101] He, Y., Zhang, X., Sun, J., 2017. Channel Pruning for Accel- powering iconic products. URL: https://www.imgtec.com/
erating Very Deep Neural Networks, in: IEEE International graphics-processors/.
Conference on Computer Vision (ICCV), IEEE. pp. 1398–1406. [118] Intel, . OpenVINO™ Toolkit. URL: https://docs.openvinotoolkit.
URL: http://openaccess.thecvf.com/content_ICCV_2017/papers/He_ org/latest/index.html.
Channel_Pruning_for_ICCV_2017_paper.pdfhttp://ieeexplore.ieee. [119] Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep
org/document/8237417/, doi:10.1109/ICCV.2017.155. network training by reducing internal covariate shift, in: International
[102] Hinton, G., 2012. Neural networks for machine learning. Technical Conference on Machine Learning (ICML), pp. 448–456. URL: http:
Report. Coursera. //arxiv.org/abs/1502.03167.
[103] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhut- [120] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.,
dinov, R.R., 2012. Improving neural networks by preventing co- Adam, H., Kalenichenko, D., 2018. Quantization and Training of
adaptation of feature detectors. ArXiv preprint , 1–18URL: http: Neural Networks for Efficient Integer-Arithmetic-Only Inference, in:
//arxiv.org/abs/1207.0580. IEEE/CVF Conference on Computer Vision and Pattern Recognition
[104] Hou, L., Yao, Q., Kwok, J.T., 2017. Loss-aware Binarization of (CVPR), IEEE. pp. 2704–2713. URL: https://ieeexplore.ieee.org/
Deep Networks, in: International Conference on Learning Represen- document/8578384/, doi:10.1109/CVPR.2018.00286.
tations(ICLR). URL: http://arxiv.org/abs/1611.01600. [121] Jia, Z., Tillman, B., Maggioni, M., Scarpazza, D.P., 2019. Dissect-
[105] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., ing the graphcore IPU architecture via microbenchmarking. ArXiv
Weyand, T., Andreetto, M., Adam, H., 2017. MobileNets: Effi- preprint .
cient Convolutional Neural Networks for Mobile Vision Applications. [122] Jia Deng, Wei Dong, Socher, R., Li-Jia Li, Kai Li, Li Fei-Fei, 2009.
ArXiv preprint URL: http://arxiv.org/abs/1704.04861. ImageNet: A large-scale hierarchical image database. IEEE/CVF
[106] Hu, H., Peng, R., Tai, Y.W., Tang, C.K., 2016. Network Trimming: Conference on Computer Vision and Pattern Recognition (CVPR) ,
A Data-Driven Neuron Pruning Approach towards Efficient Deep Ar- 248–255doi:10.1109/cvprw.2009.5206848.
chitectures. ArXiv preprint URL: http://arxiv.org/abs/1607.03250. [123] Jianchang Mao, Mohiuddin, K., Jain, A., 1994. Parsimonious network
[107] Hu, Q., Wang, P., Cheng, J., 2018. From hashing to CNNs: Train- design and feature selection through node pruning, in: Proceedings
ing binary weight networks via hashing, in: AAAI Conference on of the 12th IAPR International Conference on Pattern Recognition,
Artificial Intelligence, pp. 3247–3254. Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5),
[108] Huang, G., Chen, D., Li, T., Wu, F., Van Der Maaten, L., Weinberger, IEEE Comput. Soc. Press. pp. 622–624. URL: http://ieeexplore.
K., 2018. Multi-scale dense networks for resource efficient image ieee.org/document/577060/, doi:10.1109/icpr.1994.577060.
classification, in: International Conference on Learning Representa- [124] Jiao, Y., Han, L., Long, X., 2020. Hanguang 800 NPU – The Ultimate
tions(ICLR). URL: http://image-net.org/challenges/talks/. AI Inference Solution for Data Centers, in: 2020 IEEE Hot Chips 32
[109] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Symposium (HCS), IEEE. pp. 1–29. URL: https://ieeexplore.ieee.
Densely Connected Convolutional Networks, in: IEEE/CVF Confer- org/document/9220619/, doi:10.1109/HCS49909.2020.9220619.
ence on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. [125] Jouppi, N.P., Borchers, A., Boyle, R., Cantin, P.l., Chao, C., Clark,
2261–2269. URL: https://ieeexplore.ieee.org/document/8099726/, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Young, C.,
doi:10.1109/CVPR.2017.243. Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho,
[110] Huang, G.B., Learned-miller, E., 2014. Labeled faces in the wild: C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Patil, N.,
Updates and new reporting procedures. Dept. Comput. Sci., Univ. Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch,
Massachusetts Amherst, Amherst, MA, USA, Tech. Rep 14, 1–5. A., Kumar, N., Lacy, S., Laudon, J., Law, J., Patterson, D., Le, D.,
[111] Huang, Z., Wang, N., 2018. Data-Driven Sparse Structure Selec- Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A.,
tion for Deep Neural Networks, in: Lecture Notes in Computer Sci- Mahony, M., Miller, K., Nagarajan, R., Agrawal, G., Narayanaswami,
ence (including subseries Lecture Notes in Artificial Intelligence and R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps,
Lecture Notes in Bioinformatics). volume 11220 LNCS, pp. 317– A., Ross, J., Ross, M., Salek, A., Bajwa, R., Samadiani, E., Severn, C.,
334. URL: http://link.springer.com/10.1007/978-3-030-01270-0_ Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan,
19, doi:10.1007/978-3-030-01270-0{\_}19. M., Thorson, G., Tian, B., Bates, S., Toma, H., Tuttle, E., Vasudevan,
[112] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H., Bhatia, S., Boden,
Y., 2016a. Binarized Neural Networks, in: Advances in Neural N., 2017. In-Datacenter Performance Analysis of a Tensor Processing
Information Processing Systems (NIPS), pp. 4114–4122. URL: Unit. ACM SIGARCH Computer Architecture News 45, 1–12. URL:
http://papers.nips.cc/paper/6573-binarized-neural-networks. http://dl.acm.org/citation.cfm?doid=3140659.3080246, doi:10.1145/
[113] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y., 3140659.3080246.
2016b. Quantized Neural Networks: Training Neural Networks with [126] Judd, P., Delmas, A., Sharify, S., Moshovos, A., 2017. Cnvlutin2:
Low Precision Weights and Activations. Journal of Machine Learning Ineffectual-Activation-and-Weight-Free Deep Neural Network Com-
Research 18 18, 187–1. URL: http://arxiv.org/abs/1609.07061. puting. ArXiv preprint , 1–6URL: https://arxiv.org/abs/1705.
[114] Hwang, K., Sung, W., 2014. Fixed-point feedforward deep neural 00125.
network design using weights +1, 0, and -1, in: 2014 IEEE Workshop [127] Jung, S., Son, C., Lee, S., Son, J., Kwak, Y., Han, J.J., Hwang, S.J.,
on Signal Processing Systems (SiPS), IEEE. pp. 1–6. URL: https:// Choi, C., 2018. Learning to Quantize Deep Networks by Optimizing
ieeexplore.ieee.org/abstract/document/6986082/, doi:10.1109/SiPS. Quantization Intervals with Task Loss. Revue Internationale de la
2014.6986082. Croix-Rouge et Bulletin international des Sociétés de la Croix-Rouge
[115] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., URL: http://arxiv.org/abs/1808.05779, doi:arXiv:1808.05779v2.
Keutzer, K., 2016. SqueezeNet: AlexNet-level accuracy with 50x [128] Kathail, V., 2020. Xilinx Vitis Unified Software Platform, in: Pro-
fewer parameters and <0.5MB model size, in: ArXiv e-prints. ceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, ACM, New York, NY, USA. pp. [144] Leng, C., Li, H., Zhu, S., Jin, R., 2018. Extremely Low Bit Neural
173–174. URL: https://dl.acm.org/doi/10.1145/3373087.3375887, Network: Squeeze the Last Bit Out with ADMM. The Thirty-Second
doi:10.1145/3373087.3375887. AAAI Conference on Artificial Intelligence (AAAI-18) URL: http:
[129] Keil, 2018. CMSIS NN Software Library. URL: https:// //arxiv.org/abs/1707.09870.
arm-software.github.io/CMSIS_5/NN/html/index.html. [145] Leroux, S., Bohez, S., De Coninck, E., Verbelen, T., Vankeirsbilck,
[130] Köster, U., Webb, T.J., Wang, X., Nassar, M., Bansal, A.K., Consta- B., Simoens, P., Dhoedt, B., 2017. The cascading neural network:
ble, W.H., Elibol, O.H., Gray, S., Hall, S., Hornof, L., Khosrowshahi, building the Internet of Smart Things. Knowledge and Informa-
A., Kloss, C., Pai, R.J., Rao, N., 2017. Flexpoint: An Adaptive tion Systems 52, 791–814. URL: http://link.springer.com/10.1007/
Numerical Format for Efficient Training of Deep Neural Networks. s10115-017-1029-1, doi:10.1007/s10115-017-1029-1.
ArXiv preprint URL: http://arxiv.org/abs/1711.02213. [146] Li, F., Zhang, B., Liu, B., 2016. Ternary Weight Networks, in:
[131] Krishnamoorthi, R., 2018. Quantizing deep convolutional net- Advances in Neural Information Processing Systems (NIPS). URL:
works for efficient inference: A whitepaper. ArXiv preprint 8, 667– http://arxiv.org/abs/1605.04711.
668. URL: http://cn.arxiv.org/pdf/1806.08342.pdfhttp://arxiv. [147] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P., 2017a.
org/abs/1806.08342, doi:arXiv:1806.08342v1. Pruning Filters for Efficient ConvNets, in: International Conference
[132] Krizhevsky, A., 2009. Learning Multiple Layers of Features from on Learning Representations (ICLR). URL: http://arxiv.org/abs/
Tiny Images. Science Department, University of Toronto, Tech. 1608.08710, doi:10.1029/2009GL038531.
doi:10.1.1.222.9220. [148] Li, H., Zhang, H., Qi, X., Ruigang, Y., Huang, G., 2019. Im-
[133] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Clas- proved Techniques for Training Adaptive Deep Networks, in: 2019
sification with Deep Convolutional Neural Networks, in: Advances IEEE/CVF International Conference on Computer Vision (ICCV),
in Neural Information Processing Systems (NIPS), pp. 1–9. URL: IEEE. pp. 1891–1900. URL: https://ieeexplore.ieee.org/document/
http://code.google.com/p/cuda-convnet/, doi:http://dx.doi.org/10. 9010043/, doi:10.1109/ICCV.2019.00198.
1016/j.protcy.2014.09.007. [149] Li, M., Liu, Y.I., Liu, X., Sun, Q., You, X.I.N., Yang, H., Luan, Z.,
[134] Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pien- Gan, L., Yang, G., Qian, D., 2020a. The Deep Learning Compiler:
aar, J., Riddle, R., Shpeisman, T., Vasilache, N., Zinenko, O., 2020. A Comprehensive Survey. ArXiv preprint 1, 1–36. URL: http:
MLIR: A Compiler Infrastructure for the End of Moore’s Law. ArXiv //arxiv.org/abs/2002.03794.
preprint URL: http://arxiv.org/abs/2002.11054. [150] Li, Y., Gu, S., Mayer, C., Van Gool, L., Timofte, R., 2020b.
[135] Lavin, A., Gray, S., 2016. Fast Algorithms for Convolu- Group Sparsity: The Hinge Between Filter Pruning and Decom-
tional Neural Networks, in: IEEE/CVF Conference on Com- position for Network Compression, in: 2020 IEEE/CVF Conference
puter Vision and Pattern Recognition (CVPR), IEEE. pp. 4013– on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
4021. URL: http://ieeexplore.ieee.org/document/7780804/http:// 8015–8024. URL: https://ieeexplore.ieee.org/document/9157445/,
arxiv.org/abs/1312.5851, doi:10.1109/CVPR.2016.435. doi:10.1109/CVPR42600.2020.00804.
[136] Lebedev, V., Lempitsky, V., 2016. Fast ConvNets Using Group-Wise [151] Li, Z., Wang, Y., Zhi, T., Chen, T., 2017b. A survey of neural
Brain Damage, in: IEEE/CVF Conference on Computer Vision network accelerators. Frontiers of Computer Science 11, 746–761.
and Pattern Recognition (CVPR), IEEE. pp. 2554–2564. URL: URL: http://link.springer.com/10.1007/s11704-016-6159-1, doi:10.
http://openaccess.thecvf.com/content_cvpr_2016/html/Lebedev_ 1007/s11704-016-6159-1.
Fast_ConvNets_Using_CVPR_2016_paper.htmlhttp://ieeexplore.ieee. [152] Li, Z., Zhang, Y., Wang, J., Lai, J., 2020c. A survey of FPGA design
org/document/7780649/, doi:10.1109/CVPR.2016.280. for AI era. Journal of Semiconductors 41. doi:10.1088/1674-4926/
[137] Lebedev, V., Lempitsky, V., 2018. Speeding-up convolutional 41/2/021402.
neural networks: A survey. BULLETIN OF THE POLISH [153] Lin, J., Rao, Y., Lu, J., Zhou, J., 2017a. Runtime Neural
ACADEMY OF SCIENCES TECHNICAL SCIENCES 66, Pruning, in: Advances in Neural Information Processing Sys-
2018. URL: http://www.czasopisma.pan.pl/Content/109869/PDF/ tems (NIPS), pp. 2178–2188. URL: https://papers.nips.cc/paper/
05_799-810_00925_Bpast.No.66-6_31.12.18_K2.pdf?handler=pdfhttp: 6813-runtime-neural-pruning.pdf.
//www.czasopisma.pan.pl/Content/109869/PDF/05_799-810_00925_ [154] Lin, M., Chen, Q., Yan, S., 2014. Network in network, in: Interna-
Bpast.No.66-6_31.12.18_K2.pdf, doi:10.24425/bpas.2018.125927. tional Conference on Learning Representations(ICLR), pp. 1–10.
[138] Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, [155] Lin, X., Zhao, C., Pan, W., 2017b. Towards accurate binary convolu-
436–444. doi:10.1038/nature14539. tional neural network, in: Advances in Neural Information Processing
[139] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient- Systems (NIPS), pp. 345–353.
based learning applied to document recognition. Proceedings of the [156] Lin, Z., Courbariaux, M., Memisevic, R., Bengio, Y., 2016.
IEEE 86, 2278–2323. URL: http://ieeexplore.ieee.org/document/ Neural Networks with Few Multiplications, in: Interna-
726791/, doi:10.1109/5.726791. tional Conference on Learning Representations(ICLR). URL:
[140] LeCun, Y., Denker, J.S., Solla, S.A., 1990. Optimal Brain Damage, https://github.com/hantek/http://arxiv.org/abs/1510.03009https:
in: Advances in Neural Information Processing Systems (NIPS), p. //arxiv.org/abs/1510.03009.
598–605. doi:10.5555/109230.109298. [157] Liu, J., Musialski, P., Wonka, P., Ye, J., 2013. Tensor Completion for
[141] Lee, N., Ajanthan, T., Torr, P.H., 2019. SnIP: Single-shot network Estimating Missing Values in Visual Data. IEEE Transactions on Pat-
pruning based on connection sensitivity, in: International Conference tern Analysis and Machine Intelligence 35, 208–220. URL: http://
on Learning Representations(ICLR). ieeexplore.ieee.org/document/6138863/, doi:10.1109/TPAMI.2012.39.
[142] Lei, J., Gao, X., Song, J., Wang, X.L., Song, M.L., 2018. Survey [158] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C., 2017. Learn-
of Deep Neural Network Model Compression. Ruan Jian Xue ing Efficient Convolutional Networks through Network Slimming,
Bao/Journal of Software 29, 251–266. URL: https://www.scopus.com/ in: IEEE International Conference on Computer Vision (ICCV),
inward/record.uri?eid=2-s2.0-85049464636&doi=10.13328%2Fj.cnki. IEEE. pp. 2755–2763. URL: http://ieeexplore.ieee.org/document/
jos.005428&partnerID=40&md5=5a79dfdff4a05f188c5d553fb3b3123a, 8237560/, doi:10.1109/ICCV.2017.298.
doi:10.13328/j.cnki.jos.005428. [159] Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, T.K.T., Sun, J.,
[143] Lei, W., Chen, H., Wu, Y., 2017. Compressing Deep Convolutional 2019a. MetaPruning: Meta Learning for Automatic Neural Network
Networks Using K-means Based on Weights Distribution, in: Pro- Channel Pruning, in: IEEE International Conference on Computer
ceedings of the 2nd International Conference on Intelligent Informa- Vision. URL: http://arxiv.org/abs/1903.10258.
tion Processing - IIP’17, ACM Press, New York, New York, USA. pp. [160] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T., 2019b. Re-
1–6. URL: http://dl.acm.org/citation.cfm?doid=3144789.3144803, thinking the Value of Network Pruning, in: International Confer-
doi:10.1145/3144789.3144803. ence on Learning Representations (ICLR), pp. 1–11. URL: http:
face representations. IEEE Access 3, 2163–2175. URL: http: batch normalization help optimization?, in: Advances in Neural In-
//ieeexplore.ieee.org/document/7303876/, doi:10.1109/ACCESS.2015. formation Processing Systems (NIPS), pp. 2483–2493.
2494536. [211] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun,
[196] Preuser, T.B., Gambardella, G., Fraser, N., Blott, M., 2018. Inference Y., 2013. OverFeat: Integrated Recognition, Localization and Detec-
of quantized neural networks on heterogeneous all-programmable tion using Convolutional Networks, in: International Conference on
devices, in: 2018 Design, Automation & Test in Europe Conference Learning Representations(ICLR). URL: http://arxiv.org/abs/1312.
& Exhibition (DATE), IEEE. pp. 833–838. URL: http://ieeexplore. 6229.
ieee.org/document/8342121/, doi:10.23919/DATE.2018.8342121. [212] Settle, S.O., Bollavaram, M., D’Alberto, P., Delaye, E., Fernandez,
[197] Prost-Boucle, A., Bourge, A., Petrot, F., Alemdar, H., Caldwell, N., O., Fraser, N., Ng, A., Sirasao, A., Wu, M., 2018. Quantizing Convo-
Leroy, V., 2017. Scalable high-performance architecture for convo- lutional Neural Networks for Low-Power High-Throughput Inference
lutional ternary neural networks on FPGA, in: 2017 27th Interna- Engines. ArXiv preprint URL: http://arxiv.org/abs/1805.07941.
tional Conference on Field Programmable Logic and Applications [213] Shen, M., Han, K., Xu, C., Wang, Y., 2019. Searching for accu-
(FPL), IEEE. pp. 1–7. URL: https://hal.archives-ouvertes.fr/ rate binary neural architectures. Proceedings - 2019 International
hal-01563763http://ieeexplore.ieee.org/document/8056850/, doi:10. Conference on Computer Vision Workshop, ICCVW 2019 , 2041–
23919/FPL.2017.8056850. 2044doi:10.1109/ICCVW.2019.00256.
[198] Qin, H., Gong, R., Liu, X., Bai, X., Song, J., Sebe, N., [214] Shen, X., Yi, B., Zhang, Z., Shu, J., Liu, H., 2016. Automatic
2020a. Binary neural networks: A survey. Pattern Recognition Recommendation Technology for Learning Resources with Con-
105, 107281. URL: https://linkinghub.elsevier.com/retrieve/pii/ volutional Neural Network, in: Proceedings - 2016 International
S0031320320300856, doi:10.1016/j.patcog.2020.107281. Symposium on Educational Technology, ISET 2016, pp. 30–34.
[199] Qin, H., Gong, R., Liu, X., Shen, M., Wei, Z., Yu, F., Song, J., doi:10.1109/ISET.2016.12.
2020b. Forward and Backward Information Retention for Accu- [215] Sheng, T., Feng, C., Zhuo, S., Zhang, X., Shen, L., Aleksic, M.,
rate Binary Neural Networks, in: IEEE/CVF Conference on Com- 2018. A Quantization-Friendly Separable Convolution for Mo-
puter Vision and Pattern Recognition (CVPR), IEEE. pp. 2247–2256. bileNets. 2018 1st Workshop on Energy Efficient Machine Learn-
URL: https://ieeexplore.ieee.org/document/9157443/, doi:10.1109/ ing and Cognitive Computing for Embedded Applications (EMC2) ,
CVPR42600.2020.00232. 14–18URL: https://ieeexplore.ieee.org/document/8524017/, doi:10.
[200] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A., 2016. 1109/EMC2.2018.00011.
XNOR-Net: ImageNet Classification Using Binary Convolutional [216] Simons, T., Lee, D.J., 2019. A review of binarized neural networks.
Neural Networks, in: European Conference on Computer Vi- Electronics (Switzerland) 8. doi:10.3390/electronics8060661.
sion, Springer. pp. 525–542. URL: http://arxiv.org/abs/1603. [217] Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional
05279http://link.springer.com/10.1007/978-3-319-46493-0_32, Networks for Large-Scale Image Recognition, in: International
doi:10.1007/978-3-319-46493-0{\_}32. Conference on Learning Representations(ICLR), pp. 1–14. URL:
[201] Reed, R., 1993. Pruning Algorithms - A Survey. IEEE Transactions http://arxiv.org/abs/1409.1556.
on Neural Networks 4, 740–747. URL: http://ieeexplore.ieee.org/ [218] Singh, P., Kumar Verma, V., Rai, P., Namboodiri, V.P., 2019. Play
document/248452/, doi:10.1109/72.248452. and Prune: Adaptive Filter Pruning for Deep Model Compression, in:
[202] Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kep- Proceedings of the Twenty-Eighth International Joint Conference on
ner, J., 2019. Survey and Benchmarking of Machine Learning Ac- Artificial Intelligence, International Joint Conferences on Artificial
celerators, in: 2019 IEEE High Performance Extreme Computing Intelligence Organization, California. pp. 3460–3466. URL: https://
Conference (HPEC), IEEE. pp. 1–9. URL: https://ieeexplore.ieee. www.ijcai.org/proceedings/2019/480, doi:10.24963/ijcai.2019/480.
org/document/8916327/, doi:10.1109/HPEC.2019.8916327. [219] Society, I.C., Committee, M.S., 2008. IEEE Standard for Floating-
[203] Richard Chuang, Oliyide, O., Garrett, B., 2020. Introducing the Point Arithmetic. IEEE Std 754-2008 2008, 1–70. doi:10.1109/
Intel® Vision Accelerator Design with Intel® Arria® 10 FPGA. IEEESTD.2008.4610935.
White Paper . [220] Soudry, D., Hubara, I., Meir, R., 2014. Expectation backpropagation:
[204] Rodriguez, A., Segal, E., Meiri, E., Fomenko, E., Kim, Parameter-free training of multilayer neural networks with continuous
Y.J., Shen, H., 2018. Lower Numerical Precision Deep or discrete weights, in: Advances in Neural Information Processing
Learning Inference and Training. Intel White Paper , 1– Systems (NIPS), pp. 963–971. URL: https://dl.acm.org/doi/abs/
19URL: https://software.intel.com/sites/default/files/managed/ 10.5555/2968826.2968934.
db/92/Lower-Numerical-Precision-Deep-Learning-Jan2018.pdf. [221] Srinivas, S., Babu, R.V., 2015. Data-free parameter pruning for
[205] Rotem, N., Fix, J., Abdulrasool, S., Catron, G., Deng, S., Dzhabarov, Deep Neural Networks, in: Procedings of the British Machine Vi-
R., Gibson, N., Hegeman, J., Lele, M., Levenstein, R., Montgomery, sion Conference 2015, British Machine Vision Association. pp.
J., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., Smelyan- 1–31. URL: http://www.bmva.org/bmvc/2015/papers/paper031/index.
skiy, M., Wang, M., 2018. Glow: Graph lowering compiler techniques htmlhttp://arxiv.org/abs/1507.06149, doi:10.5244/C.29.31.
for neural networks. ArXiv preprint . [222] Srivastava, N., Hinton, G., . . . , A.K.T.j.o.m., 2014, U., 2014.
[206] Ruffy, F., Chahal, K., 2019. The State of Knowledge Distillation Dropout: a simple way to prevent neural networks from overfitting.
for Classification. ArXiv preprint URL: http://arxiv.org/abs/1912. The journal of machine learning research 15, 1929–1958. URL:
10850. http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.
[207] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, pdf?utm_content=buffer79b43&utm_medium=social&utm_source=
S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, twitter.com&utm_campaign=buffer, doi:10.5555/2627435.2670313.
A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recogni- [223] Sun, J., Luo, X., Gao, H., Wang, W., Gao, Y., Yang, X., 2020. Cate-
tion Challenge. International Journal of Computer Vision 115, 211– gorizing Malware via A Word2Vec-based Temporal Convolutional
252. URL: http://link.springer.com/10.1007/s11263-015-0816-y, Network Scheme. Journal of Cloud Computing 9. doi:10.1186/
doi:10.1007/s11263-015-0816-y. s13677-020-00200-y.
[208] Saad, D., Marom, E., 1990. Training Feed Forward Nets with Binary [224] Sun, M., Song, Z., Jiang, X., Pan, J., Pang, Y., 2017. Learning
Weights Via a Modified CHIR Algorithm. Complex Systems 4, 573– Pooling for Convolutional Neural Network. Neurocomputing 224, 96–
586. URL: https://www.complex-systems.com/pdf/04-5-5.pdf. 104. URL: http://dx.doi.org/10.1016/j.neucom.2016.10.049, doi:10.
[209] Sabour, S., Frosst, N., Hinton, G.E., 2017. Dynamic routing between 1016/j.neucom.2016.10.049.
capsules, in: Advances in Neural Information Processing Systems [225] Sze, V., Chen, Y.H.H., Yang, T.J.J., Emer, J.S., 2017. Efficient
(NIPS), pp. 3857–3867. Processing of Deep Neural Networks: A Tutorial and Survey. Pro-
[210] Santurkar, S., Tsipras, D., Ilyas, A., Madry, A., 2018. How does ceedings of the IEEE 105, 2295–2329. URL: http://ieeexplore.
ieee.org/document/8114708/, doi:10.1109/JPROC.2017.2761740. [243] Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J., 2016. Quantized Con-
[226] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., volutional Neural Networks for Mobile Devices, in: IEEE/CVF Con-
Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper ference on Computer Vision and Pattern Recognition (CVPR), IEEE.
with convolutions, in: Proceedings of the IEEE Computer Society pp. 4820–4828. URL: http://arxiv.org/abs/1512.06473http://
Conference on Computer Vision and Pattern Recognition, IEEE. pp. ieeexplore.ieee.org/document/7780890/, doi:10.1109/CVPR.2016.521.
1–9. URL: http://ieeexplore.ieee.org/document/7298594/, doi:10. [244] Wu, S., Li, G., Chen, F., Shi, L., 2018a. Training and Inference with
1109/CVPR.2015.7298594. Integers in Deep Neural Networks, in: International Conference on
[227] TansorFlow, . Fixed Point Quantization. URL: https://www. Learning Representations (ICLR). URL: http://arxiv.org/abs/1802.
tensorflow.org/lite/guide. 04680.
[228] Technologies, Q., 2019. Snapdragon Neural Processing Engine SDK. [245] Wu, S., Li, G., Deng, L., Liu, L., Wu, D., Xie, Y., Shi, L., 2019.
URL: https://developer.qualcomm.com/docs/snpe/index.html. L1-Norm Batch Normalization for Efficient Training of Deep Neural
[229] Tencent, 2019. NCNN is a high-performance neural network infer- Networks. IEEE Transactions on Neural Networks and Learning Sys-
ence framework optimized for the mobile platform. URL: https: tems 30, 2043–2051. URL: https://ieeexplore.ieee.org/abstract/
//github.com/Tencent/ncnn. document/8528524/https://ieeexplore.ieee.org/document/8528524/,
[230] Tishbirani, R., 1996. Regression shrinkage and selection via the doi:10.1109/TNNLS.2018.2876179.
Lasso. URL: https://statweb.stanford.edu/~tibs/lasso/lasso.pdf. [246] Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L.S., Grau-
[231] Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., man, K., Feris, R., 2018b. BlockDrop: Dynamic Inference Paths
Jahre, M., Vissers, K., 2016. FINN: A Framework for Fast, Scal- in Residual Networks, in: IEEE/CVF Conference on Computer
able Binarized Neural Network Inference. Proceedings of the 2017 Vision and Pattern Recognition (CVPR), IEEE. pp. 8817–8826.
ACM/SIGDA International Symposium on Field-Programmable Gate URL: https://ieeexplore.ieee.org/document/8579017/, doi:10.1109/
Arrays - FPGA ’17 , 65–74URL: http://dl.acm.org/citation.cfm? CVPR.2018.00919.
doid=3020078.3021744, doi:10.1145/3020078.3021744. [247] Xiaomi, 2019. MACE is a deep learning inference framework
[232] Vanholder, H., 2016. Efficient Inference with TensorRT. Technical optimized for mobile heterogeneous computing platforms. URL:
Report. https://github.com/XiaoMi/mace/.
[233] Vanhoucke, V., Senior, A., Mao, M.Z., 2011. Improving the speed [248] Xilinx, Inc, 2018. Accelerating DNNs with Xilinx Alveo Accelerator
of neural networks on CPUs URL: https://research.google/pubs/ Cards (WP504). White Paper 504, 1–11. URL: www.xilinx.com1.
pub37631/. [249] Xu, J., Huan, Y., Zheng, L.R., Zou, Z., 2019. A Low-Power Arith-
[234] Venieris, S.I., Kouris, A., Bouganis, C.S., 2018. Toolflows for Map- metic Element for Multi-Base Logarithmic Computation on Deep
ping Convolutional Neural Networks on FPGAs. ACM Comput- Neural Networks, in: International System on Chip Conference, IEEE.
ing Surveys 51, 1–39. URL: http://dl.acm.org/citation.cfm?doid= pp. 260–265. URL: https://ieeexplore.ieee.org/document/8618560/,
3212709.3186332, doi:10.1145/3186332. doi:10.1109/SOCC.2018.8618560.
[235] Venkatesh, G., Nurvitadhi, E., Marr, D., 2017. Accelerating [250] Xu, S., Huang, A., Chen, L., Zhang, B., 2020. Convolutional Neural
Deep Convolutional Networks using low-precision and sparsity, Network Pruning: A Survey, in: 2020 39th Chinese Control Confer-
in: 2017 IEEE International Conference on Acoustics, Speech ence (CCC), IEEE. pp. 7458–7463. URL: https://ieeexplore.ieee.
and Signal Processing (ICASSP), IEEE. pp. 2861–2865. URL: org/document/9189610/, doi:10.23919/CCC50068.2020.9189610.
https://arxiv.org/pdf/1610.00324.pdfhttp://ieeexplore.ieee.org/ [251] Xu, X., Lu, Q., Yang, L., Hu, S., Chen, D., Hu, Y., Shi, Y., 2018a.
document/7952679/, doi:10.1109/ICASSP.2017.7952679. Quantization of Fully Convolutional Networks for Accurate Biomed-
[236] Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S., 2019a. HAQ: Hardware- ical Image Segmentation, in: Proceedings of the IEEE/CVF Con-
Aware Automated Quantization With Mixed Precision, in: 2019 ference on Computer Vision and Pattern Recognition (CVPR), pp.
IEEE/CVF Conference on Computer Vision and Pattern Recognition 8300–8308. doi:10.1109/CVPR.2018.00866.
(CVPR), IEEE. pp. 8604–8612. URL: http://arxiv.org/abs/1811. [252] Xu, Z., Hsu, Y.C., Huang, J., 2018b. Training shallow and thin net-
08886https://ieeexplore.ieee.org/document/8954415/, doi:10.1109/ works for acceleration via knowledge distillation with conditional
CVPR.2019.00881. adversarial networks, in: International Conference on Learning Rep-
[237] Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K., 2018a. resentations (ICLR) - Workshop.
Training deep neural networks with 8-bit floating point numbers, in: [253] Yang, J., Shen, X., Xing, J., Tian, X., Li, H., Deng, B., Huang, J., Hua,
Advances in Neural Information Processing Systems (NIPS), pp. X.s., 2019. Quantization Networks, in: 2019 IEEE/CVF Conference
7675–7684. on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
[238] Wang, P., Cheng, J., 2017. Fixed-Point Factorized Networks, in: 7300–7308. URL: https://ieeexplore.ieee.org/document/8953531/,
IEEE/CVF Conference on Computer Vision and Pattern Recognition doi:10.1109/CVPR.2019.00748.
(CVPR), IEEE. pp. 3966–3974. URL: http://ieeexplore.ieee.org/ [254] Yang, Y., Deng, L., Wu, S., Yan, T., Xie, Y., Li, G., 2020. Training
document/8099905/, doi:10.1109/CVPR.2017.422. high-performance and large-scale deep neural networks with full 8-bit
[239] Wang, P., Hu, Q., Zhang, Y., Zhang, C., Liu, Y., Cheng, J., 2018b. integers. Neural Networks 125, 70–82. doi:10.1016/j.neunet.2019.
Two-Step Quantization for Low-bit Neural Networks, in: Proceed- 12.027.
ings of the IEEE/CVF Conference on Computer Vision and Pattern [255] Ye, J., Lu, X., Lin, Z., Wang, J.Z., 2018. Rethinking the Smaller-
Recognition (CVPR), pp. 4376–4384. doi:10.1109/CVPR.2018.00460. Norm-Less-Informative Assumption in Channel Pruning of Convolu-
[240] Wang, Z., Lu, J., Tao, C., Zhou, J., Tian, Q., 2019b. Learning tion Layers. ArXiv preprint URL: http://arxiv.org/abs/1802.00124.
channel-wise interactions for binary convolutional neural networks, [256] Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., Xin, J., 2019.
in: Proceedings of the IEEE/CVF Conference on Computer Vision Blended coarse gradient descent for full quantization of deep neu-
and Pattern Recognition (CVPR), pp. 568–577. doi:10.1109/CVPR. ral networks. Research in Mathematical Sciences 6. doi:10.1007/
2019.00066. s40687-018-0177-6.
[241] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H., 2016. Learning [257] Yogatama, D., Mann, G., 2014. Efficient Transfer Learning Method
Structured Sparsity in Deep Neural Networks, in: Advances in for Automatic Hyperparameter Tuning, in: Kaski, S., Corander, J.
Neural Information Processing Systems (NIPS), IEEE. pp. 2074– (Eds.), Proceedings of the Seventeenth International Conference on
2082. URL: https://dl.acm.org/doi/abs/10.5555/3157096.3157329, Artificial Intelligence and Statistics, PMLR, Reykjavik, Iceland. pp.
doi:10.1016/j.ccr.2008.06.009. 1077–1085. URL: http://proceedings.mlr.press/v33/yogatama14.
[242] Wu, H., Judd, P., Zhang, X., Isaev, M., Micikevicius, P., 2020. Integer html.
quantization for deep learning inference: Principles and empirical [258] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.,
evaluation. ArXiv preprint , 1–20. 2017. Scalpel: Customizing DNN pruning to the underlying hardware
parallelism. ACM SIGARCH Computer Architecture News 45, 548– Low Bitwidth Gradients. ArXiv preprint abs/1606.0, 1–13. URL:
560. URL: http://dl.acm.org/citation.cfm?doid=3140659.3080215, https://arxiv.org/abs/1606.06160.
doi:10.1145/3140659.3080215. [273] Zhou, S.C., Wang, Y.Z., Wen, H., He, Q.Y., Zou, Y.H., 2017b. Bal-
[259] Yu, J., Yang, L., Xu, N., Yang, J., Huang, T., 2018. Slimmable Neu- anced Quantization: An Effective and Efficient Approach to Quan-
ral Networks, in: International Conference on Learning Representa- tized Neural Networks. Journal of Computer Science and Technology
tions(ICLR), International Conference on Learning Representations, 32, 667–682. doi:10.1007/s11390-017-1750-y.
ICLR. pp. 1–12. URL: http://arxiv.org/abs/1812.08928. [274] Zhu, C., Han, S., Mao, H., Dally, W.J., 2017. Trained Ternary Quan-
[260] Yuan, M., Lin, Y., 2006. Model selection and estimation tization, in: International Conference on Learning Representations
in regression with grouped variables. Journal of the Royal (ICLR), pp. 1–10. URL: http://arxiv.org/abs/1612.01064.
Statistical Society: Series B (Statistical Methodology) 68, 49– [275] Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X., Yan, J.,
67. URL: http://doi.wiley.com/10.1111/j.1467-9868.2005.00532.x, . Towards Unified INT8 Training for Convolutional Neural Network,
doi:10.1111/j.1467-9868.2005.00532.x. in: Proceedings of the IEEE/CVF Conference on Computer Vision
[261] Yuan, Z., Hu, J., Wu, D., Ban, X., 2020. A dual-attention recurrent and Pattern Recognition (CVPR). URL: http://arxiv.org/abs/1912.
neural network method for deep cone thickener underflow concen- 12607.
tration prediction. Sensors (Switzerland) 20, 1–18. doi:10.3390/ [276] Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I., 2019. Structured
s20051260. binary neural networks for accurate image classification and semantic
[262] Zhang, D., Yang, J., Ye, D., Hua, G., 2018. LQ-Nets: Learned segmentation. Proceedings of the IEEE Computer Society Conference
quantization for highly accurate and compact deep neural networks, on Computer Vision and Pattern Recognition 2019-June, 413–422.
in: Lecture Notes in Computer Science (including subseries Lecture doi:10.1109/CVPR.2019.00050.
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), [277] Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2017. Learning
pp. 373–390. doi:10.1007/978-3-030-01237-3{\_}23. Transferable Architectures for Scalable Image Recognition. Pro-
[263] Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B., 2019a. ceedings of the IEEE/CVF Conference on Computer Vision and
Recent Advances in Convolutional Neural Network Acceleration. Pattern Recognition (CVPR) , 8697–8710URL: https://ieeexplore.
Neurocomputing 323, 37–51. URL: https://linkinghub.elsevier. ieee.org/abstract/document/8579005/.
com/retrieve/pii/S0925231218311007, doi:10.1016/j.neucom.2018.09.
038.
[264] Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q.,
Chen, T., Chen, Y., 2016a. Cambricon-X: An accelerator for
sparse neural networks, in: 2016 49th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture (MICRO), IEEE. pp. 1–12.
URL: http://ieeexplore.ieee.org/document/7783723/, doi:10.1109/
MICRO.2016.7783723.
[265] Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R.,
Bengio, Y., 2016b. Architectural complexity measures of recurrent
neural networks, in: Advances in Neural Information Processing
Systems (NIPS), pp. 1830–1838.
[266] Zhang, Y., Zhao, C., Ni, B., Zhang, J., Deng, H., 2019b. Exploiting
Channel Similarity for Accelerating Deep Convolutional Neural Net-
works. ArXiv preprint , 1–14URL: http://arxiv.org/abs/1908.02620.
[267] Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.H., Srivastava, M.,
Gupta, R., Zhang, Z., 2017. Accelerating Binarized Convolutional
Neural Networks with Software-Programmable FPGAs, in: Proceed-
ings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays - FPGA ’17, ACM Press, New York, New
York, USA. pp. 15–24. URL: http://dl.acm.org/citation.cfm?doid=
3020078.3021741, doi:10.1145/3020078.3021741.
[268] Zhong, K., Zhao, T., Ning, X., Zeng, S., Guo, K., Wang, Y., Yang, H.,
2020. Towards Lower Bit Multiplication for Convolutional Neural
Network Training. ArXiv preprint URL: http://arxiv.org/abs/2006.
02804.
[269] Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y., 2017a.
Incremental Network Quantization: Towards Lossless
CNNs with Low-Precision Weights, in: International
Conference on Learning Representations(ICLR). URL:
https://github.com/Zhouaojun/Incremental-http://arxiv.org/
abs/1702.03044http://cn.arxiv.org/pdf/1702.03044.pdf.
[270] Zhou, H., Alvarez, J.M., Porikli, F., 2016a. Less Is More: To-
wards Compact CNNs, in: European Conference on Computer
Vision, pp. 662–677. URL: https://link.springer.com/chapter/
10.1007/978-3-319-46493-0_40http://link.springer.com/10.1007/
978-3-319-46493-0_40, doi:10.1007/978-3-319-46493-0{\_}40.
[271] Zhou, S., Kannan, R., Prasanna, V.K., 2018. Accelerating low rank
matrix completion on FPGA, in: 2017 International Conference on
Reconfigurable Computing and FPGAs, ReConFig 2017, IEEE. pp.
1–7. URL: http://ieeexplore.ieee.org/document/8279771/, doi:10.
1109/RECONFIG.2017.8279771.
[272] Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y., 2016b. DoReFa-
Net: Training Low Bitwidth Convolutional Neural Networks with