Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Quantization and Training of Neural Networks for Efficient

Integer-Arithmetic-Only Inference

Benoit Jacob Skirmantas Kligys Bo Chen Menglong Zhu


Matthew Tang Andrew Howard Hartwig Adam Dmitry Kalenichenko
arXiv:1712.05877v1 [cs.LG] 15 Dec 2017

{benoitjacob,skligys,bochen,menglong,
mttang,howarda,hadam,dkalenichenko}@google.com
Google Inc.

Abstract tizes the weights and / or activations of a CNN from 32


bit floating point into lower bit-depth representations. This
The rising popularity of intelligent mobile devices and methodology, embraced by approaches such as Ternary
the daunting computational cost of deep learning-based weight networks (TWN [22]), Binary Neural Networks
models call for efficient and accurate on-device inference (BNN [14]), XNOR-net [27], and more [8, 21, 26, 33, 34,
schemes. We propose a quantization scheme that allows 35], is the focus of our investigation. Despite their abun-
inference to be carried out using integer-only arithmetic, dance, current quantization approaches are lacking in two
which can be implemented more efficiently than floating respects when it comes to trading off latency with accuracy.
point inference on commonly available integer-only hard- First, prior approaches have not been evaluated on a
ware. We also co-design a training procedure to preserve reasonable baseline architecture. The most common base-
end-to-end model accuracy post quantization. As a result, line architectures, AlexNet [20], VGG [28] and GoogleNet
the proposed quantization scheme improves the tradeoff be- [29], are all over-parameterized by design in order to extract
tween accuracy and on-device latency. The improvements marginal accuracy improvements. Therefore, it is easy to
are significant even on MobileNets, a model family known obtain sizable compression of these architectures, reducing
for run-time efficiency, and are demonstrated in ImageNet quantization experiments on these architectures to proof-
classification and COCO detection on popular CPUs. of-concepts at best. Instead, a more meaningful challenge
would be to quantize model architectures that are already ef-
ficient at trading off latency with accuracy, e.g. MobileNets.
1. Introduction
Second, many quantization approaches do not deliver
Current state-of-the-art Convolutional Neural Networks verifiable efficiency improvements on real hardware. Ap-
(CNNs) are not well suited for use on mobile devices. Since proaches that quantize only the weights ([2, 4, 8, 33]) are
the advent of AlexNet [20], modern CNNs have primarily primarily concerned with on-device storage and less with
been appraised according to classification / detection accu- computational efficiency. Notable exceptions are binary,
racy. Thus network architectures have evolved without re- ternary and bit-shift networks [14, 22, 27]. These latter
gard to model complexity and computational efficiency. On approaches employ weights that are either 0 or powers of
the other hand, successful deployment of CNNs on mobile 2, which allow multiplication to be implemented by bit
platforms such as smartphones, AR/VR devices (HoloLens, shifts. However, while bit-shifts can be efficient in cus-
Daydream), and drones require small model sizes to accom- tom hardware, they provide little benefit on existing hard-
modate limited on-device memory, and low latency to main- ware with multiply-add instructions that, when properly
tain user engagement. This has led to a burgeoning field of used (i.e. pipelined), are not more expensive than addi-
research that focuses on reducing the model size and infer- tions alone. Moreover, multiplications are only expensive
ence time of CNNs with minimal accuracy losses. if the operands are wide, and the need to avoid multiplica-
Approaches in this field roughly fall into two cate- tions diminishes with bit depth once both weights and acti-
gories. The first category, exemplified by MobileNet [10], vations are quantized. Notably, these approaches rarely pro-
SqueezeNet [16], ShuffleNet [32], and DenseNet [11], de- vide on-device measurements to verify the promised timing
signs novel network architectures that exploit computation improvements. More runtime-friendly approaches quantize
/ memory efficient operations. The second category quan- both the weights and the activations into 1 bit representa-

1
uint8 70
ReLU6 output ReLU6 act quant output

uint8

Top 1 Accuracy
60
+ +
uint32
uint32
50
biases biases
conv conv Float
8-bit
uint8 uint8
40
10 20 40 80 160 320
input weights input wt quant weights
Latency (ms)

(a) Integer-arithmetic-only inference (b) Training with simulated quantization (c) ImageNet latency-vs-accuracy tradeoff

Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.

tions [14, 27, 34]. With these approaches, both multiplica- Our work draws inspiration from [7], which leverages
tions and additions can be implemented by efficient bit-shift low-precision fixed-point arithmetic to accelerate the train-
and bit-count operations, which are showcased in custom ing speed of CNNs, and from [31], which uses 8-bit fixed-
GPU kernels (BNN [14]). However, 1 bit quantization of- point arithmetic to speed up inference on x86 CPUs. Our
ten leads to substantial performance degradation, and may quantization scheme focuses instead on improving the in-
be overly stringent on model representation. ference speed vs accuracy tradeoff on mobile CPUs.
In this paper we address the above issues by improving
the latency-vs-accuracy tradeoffs of MobileNets on com- 2. Quantized Inference
mon mobile hardware. Our specific contributions are:
2.1. Quantization scheme
• We provide a quantization scheme (section 2.1) that In this section, we describe our general quantization
quantizesh both weights and activations as 8-bit integers, scheme12 , that is, the correspondence between the bit-
and just a few parameters (bias vectors) as 32-bit integers. representation of values (denoted q below, for “quantized
• We provide a quantized inference framework that is ef- value”) and their interpretation as mathematical real num-
ficiently implementable on integer-arithmetic-only hard- bers (denoted r below, for “real value”). Our quantization
ware such as the Qualcomm Hexagon (sections 2.2, 2.3), scheme is implemented using integer-only arithmetic dur-
and we describe an efficient, accurate implementation on ing inference and floating-point arithmetic during training,
ARM NEON (Appendix B). with both implementations maintaining a high degree of
correspondence with each other. We achieve this by first
• We provide a quantized training framework (section 3) providing a mathematically rigorous definition of our quan-
co-designed with our quantized inference to minimize the tization scheme, and separately adopting this scheme for
loss of accuracy from quantization on real models. both integer-arithmetic inference and floating-point train-
ing.
• We apply our frameworks to efficient classification and
detection systems based on MobileNets and provide 1 The quantization scheme described here is the one adopted in Tensor-

benchmark results on popular ARM CPUs (section 4) Flow Lite [5] and we will refer to specific parts of its code to illustrate
that show significant improvements in the latency-vs- aspects discussed below.
2 We had earlier described this quantization scheme in the documen-
accuracy tradeoffs for state-of-the-art MobileNet archi- tation of gemmlowp [18]. That page may still be useful as an alternate
tectures, demonstrated in ImageNet classification [3], treatment of some of the topics developed in this section, and for its self-
COCO object detection [23], and other tasks. contained example code.
A basic requirement of our quantization scheme is that it computation, and how the latter can be designed to involve
permits efficient implementation of all arithmetic using only only integer arithmetic even though the scale values S are
integer arithmetic operations on the quantized values (we not integers.
eschew implementations requiring lookup tables because Consider the multiplication of two square N × N ma-
these tend to perform poorly compared to pure arithmetic trices of real numbers, r1 and r2 , with their product repre-
on SIMD hardware). This is equivalent to requiring that the sented by r3 = r1 r2 . We denote the entries of each of these
(i,j)
quantization scheme be an affine mapping of integers q to matrices rα (α = 1, 2 or 3) as rα for 1 6 i, j 6 N ,
real numbers r, i.e. of the form and the quantization parameters with which they are quan-
(i,j)
tized as (Sα , Zα ). We denote the quantized entries by qα .
r = S(q − Z) (1)
Equation (1) then becomes:
for some constants S and Z. Equation (1) is our quantiza-
rα(i,j) = Sα (qα(i,j) − Zα ). (2)
tion scheme and the constants S and Z are our quantization
parameters. Our quantization scheme uses a single set of From the definition of matrix multiplication, we have
quantization parameters for all values within each activa-
tions array and within each weights array; separate arrays N
(i,k) (i,j) (j,k)
X
use separate quantization parameters. S3 (q3 − Z3 ) = S1 (q1 − Z1 )S2 (q2 − Z2 ), (3)
For 8-bit quantization, q is quantized as an 8-bit integer j=1
(for B-bit quantization, q is quantized as an B-bit integer).
Some arrays, typically bias vectors, are quantized as 32-bit which can be rewritten as
integers, see section 2.4. N
(i,k) (i,j) (j,k)
X
The constant S (for “scale”) is an arbitrary positive real q3 = Z3 + M (q1 − Z1 )(q2 − Z2 ), (4)
number. It is typically represented in software as a floating- j=1
point quantity, like the real values r. Section 2.2 describes
methods for avoiding the representation of such floating- where the multiplier M is defined as
point quantities in the inference workload. S1 S2
The constant Z (for “zero-point”) is of the same type M := . (5)
S3
as quantized values q, and is in fact the quantized value q
corresponding to the real value 0. This allows us to auto- In Equation (4), the only non-integer is the multiplier M .
matically meet the requirement that the real value r = 0 be As a constant depending only on the quantization scales
exactly representable by a quantized value. The motivation S1 , S2 , S3 , it can be computed offline. We empirically find
for this requirement is that efficient implementation of neu- it to always be in the interval (0, 1), and can therefore ex-
ral network operators often requires zero-padding of arrays press it in the normalized form
around boundaries.
Our discussion so far is summarized in the following M = 2−n M0 (6)
quantized buffer data structure3 , with one instance of such a
buffer existing for each activations array and weights array where M0 is in the interval [0.5, 1) and n is a non-negative
in a neural network. We use C++ syntax because it allows integer. The normalized multiplier M0 now lends itself well
the unambiguous conveyance of types. to being expressed as a fixed-point multiplier (e.g. int16 or
int32 depending on hardware capability). For example, if
template<typename QType> // e.g. QType=uint8 int32 is used, the integer representing M0 is the int32 value
struct QuantizedBuffer {
vector<QType> q; // the quantized values nearest to 231 M0 . Since M0 > 0.5, this value is always at
float S; // the scale least 230 and will therefore always have at least 30 bits of
QType Z; // the zero-point relative accuracy. Multiplication by M0 can thus be imple-
}; mented as a fixed-point multiplication4. Meanwhile, multi-
2.2. Integer-arithmetic-only matrix multiplication plication by 2−n can be implemented with an efficient bit-
shift, albeit one that needs to have correct round-to-nearest
We now turn to the question of how to perform inference behavior, an issue that we return to in Appendix B.
using only integer arithmetic, i.e. how to use Equation (1)
to translate real-numbers computation into quantized-values 2.3. Efficient handling of zero-points
3 The actual data structures in the TensorFlow Lite [5] Converter are In order to efficiently implement the evaluation of Equa-
QuantizationParams and Array in this header file. As we discuss tion (4) without having to perform 2N 3 subtractions and
in the next subsection, this data structure, which still contains a floating-
point quantity, does not appear in the actual quantized on-device inference 4 The computation discussed in this section is implemented in Tensor-

code. Flow Lite [5] reference code for a fully-connected layer.


without having to expand the operands of the multiplication We take the q1 matrix to be the weights, and the q2 matrix
into 16-bit integers, we first notice that by distributing the to be the activations. Both the weights and activations are
multiplication in Equation (4), we can rewrite it as of type uint8 (we could have equivalently chosen int8, with
 suitably modified zero-points). Accumulating products of
(i,k) (k)
uint8 values requires a 32-bit accumulator, and we choose a
q3 = Z3 + M N Z1 Z2 − Z1 a2 signed type for the accumulator for a reason that will soon
become clear. The sum in (9) is thus of the form:
 (7)
N
(i) (i,j) (j,k) 
X
−Z2 ā1 + q1 q2 int32 += uint8 * uint8. (10)
j=1
In order to have the quantized bias-addition be the addition
where of an int32 bias into this int32 accumulator, the bias-vector
N N is quantized such that: it uses int32 as its quantized data
(k) (j,k) (i) (i,j) type; it uses 0 as its quantization zero-point Zbias ; and its
X X
a2 := q2 , ā1 := q1 . (8)
j=1 j=1
quantization scale Sbias is the same as that of the accumu-
lators, which is the product of the scales of the weights and
(k) (i)
Each a2 or ā1 takes only N additions to compute, so they of the input activations. In the notation of section 2.3,
collectively take only 2N 2 additions. The rest of the cost of
the evaluation of (7) is almost entirely concentrated in the Sbias = S1 S2 , Zbias = 0. (11)
core integer matrix multiplication accumulation
Although the bias-vectors are quantized as 32-bit values,
N they account for only a tiny fraction of the parameters in a
(i,j) (j,k)
X
q1 q2 (9) neural network. Furthermore, the use of higher precision
j=1 for bias vectors meets a real need: as each bias-vector entry
is added to many output activations, any quantization error
which takes 2N 3 arithmetic operations; indeed, everything in the bias-vector tends to act as an overall bias (i.e. an error
else involved in (7) is O(N 2 ) with a small constant in the O. term with nonzero mean), which must be avoided in order
Thus, the expansion into the form (7) and the factored-out to preserve good end-to-end neural network accuracy6.
(k) (i)
computation of a2 and ā1 enable low-overhead handling With the final value of the int32 accumulator, there re-
of arbitrary zero-points for anything but the smallest values main three things left to do: scale down to the final scale
of N , reducing the problem to the same core integer matrix used by the 8-bit output activations, cast down to uint8 and
multiplication accumulation (9) as we would have to com- apply the activation function to yield the final 8-bit output
pute in any other zero-points-free quantization scheme. activation.
2.4. Implementation of a typical fused layer The down-scaling corresponds to multiplication by the
multiplier M in equation (7). As explained in section 2.2, it
We continue the discussion of section 2.3, but now ex- is implemented as a fixed-point multiplication by a normal-
plicitly define the data types of all quantities involved, and ized multiplier M0 and a rounding bit-shift. Afterwards, we
modify the quantized matrix multiplication (7) to merge perform a saturating cast to uint8, saturating to the range
the bias-addition and activation function evaluation directly [0, 255].
into it. This fusing of whole layers into a single operation We focus on activation functions that are mere clamps,
is not only an optimization. As we must reproduce in in- e.g. ReLU, ReLU6. Mathematical functions are discussed
ference code the same arithmetic that is used in training, in appendix A.1 and we do not currently fuse them into such
the granularity of fused operators in inference code (taking layers. Thus, the only thing that our fused activation func-
an 8-bit quantized input and producing an 8-bit quantized tions need to do is to further clamp the uint8 value to some
output) must match the placement of “fake quantization” sub-interval of [0, 255] before storing the final uint8 output
operators in the training graph (section 3). activation. In practice, the quantized training process (sec-
For our implementation on ARM and x86 CPU ar- tion 3) tends to learn to make use of the whole output uint8
chitectures, we use the gemmlowp library [18], whose [0, 255] interval so that the activation function no longer
GemmWithOutputPipeline entry point provides sup- does anything, its effect being subsumed in the clamping
ports the fused operations that we now describe5 . to [0, 255] implied in the saturating cast to uint8.
5 The discussion in this section is implemented in TensorFlow Lite [5]

for e.g. a Convolutional operator (reference code is self-contained, opti- 6 The quantization of bias-vectors discussed here is implemented here

mized code calls into gemmlowp [18]). in the TensorFlow Lite [5] Converter.
3. Training with simulated quantization 3.1. Learning quantization ranges
A common approach to training quantized networks is Quantization ranges are treated differently for weight
to train in floating point and then quantize the resulting quantization vs. activation quantization:
weights (sometimes with additional post-quantization train- • For weights, the basic idea is simply to set a := min w,
ing for fine-tuning). We found that this approach works b := max w. We apply a minor tweak to this so that
sufficiently well for large models with considerable repre- the weights, once quantized as int8 values, only range
sentational capacity, but leads to significant accuracy drops in [−127, 127] and never take the value −128, as this en-
for small models. Common failure modes for simple post- ables a substantial optimization opportunity (for more de-
training quantization include: 1) large differences (more tails, see Appendix B).
than 100×) in ranges of weights for different output chan-
• For activations, ranges depend on the inputs to the net-
nels (section 2 mandates that all channels of the same layer
work. To estimate the ranges, we collect [a; b] ranges
be quantized to the same resolution, which causes weights
seen on activations during training and then aggregate
in channels with smaller ranges to have much higher relative
them via exponential moving averages (EMA) with the
error) and 2) outlier weight values that make all remaining
smoothing parameter being close to 1 so that observed
weights less precise after quantization.
ranges are smoothed across thousands of training steps.
We propose an approach that simulates quantization ef- Given significant delay in the EMA updating activation
fects in the forward pass of training. Backpropagation still ranges when the ranges shift rapidly, we found it useful
happens as usual, and all weights and biases are stored in to completely disable activation quantization at the start
floating point so that they can be easily nudged by small of training (say, for 50 thousand to 2 million steps). This
amounts. The forward propagation pass however simu- allows the network to enter a more stable state where ac-
lates quantized inference as it will happen in the inference tivation quantization ranges do not exclude a significant
engine, by implementing in floating-point arithmetic the fraction of values.
rounding behavior of the quantization scheme that we in-
troduced in section 2: In both cases, the boundaries [a; b] are nudged so that
value 0.0 is exactly representable as an integer z(a, b, n)
• Weights are quantized before they are convolved with after quantization. As a result, the learned quantization pa-
the input. If batch normalization (see [17]) is used for rameters map to the scale S and zero-point Z in equation 1:
the layer, the batch normalization parameters are “folded S = s(a, b, n), Z = z(a, b, n) (13)
into” the weights before quantization, see section 3.2.
Below we depict simulated quantization assuming that
• Activations are quantized at points where they would be the computations of a neural network are captured as a Ten-
during inference, e.g. after the activation function is ap- sorFlow graph [1]. A typical workflow is described in Al-
plied to a convolutional or fully connected layer’s output, gorithm 1. Optimization of the inference graph by fusing
or after a bypass connection adds or concatenates the out-
puts of several layers together such as in ResNets. Algorithm 1 Quantized graph training and inference
1: Create a training graph of the floating-point model.
For each layer, quantization is parameterized by the 2: Insert fake quantization TensorFlow operations in lo-
number of quantization levels and clamping range, and is cations where tensors will be downcasted to fewer bits
performed by applying point-wise the quantization function during inference according to equation 12.
q defined as follows: 3: Train in simulated quantized mode until convergence.
4: Create and optimize the inference graph for running in
clamp(r; a, b) := min (max(x, a), b) a low bit inference engine.
b−a 5: Run inference using the quantized inference graph.
s(a, b, n) :=
n−1
 
clamp(r; a, b) − a and removing operations is outside the scope of this pa-
q(r; a, b, n) := s(a, b, n) + a,
s(a, b, n) per. Source code for graph modifications (inserting fake
(12) quantization operations, creating and optimizing the infer-
ence graph) and a low bit inference engine has been open-
where r is a real-valued number to be quantized, [a; b] is the sourced with TensorFlow contributions in [19].
quantization range, n is the number of quantization levels, Figure 1.1a and b illustrate TensorFlow graphs before
and ⌊·⌉ denotes rounding to the nearest integer. n is fixed and after quantization for a simple convolutional layer. Il-
for all layers in our experiments, e.g. n = 28 = 256 for 8 lustrations of the more complex convolution with a bypass
bit quantization. connection in figure C.3 can be found in figure C.4.
Note that the biases are not quantized because they are ResNet depth 50 100 150
represented as 32-bit integers in the inference process, with
Floating-point accuracy 76.4% 78.0% 78.8%
a much higher range and precision compared to the 8 bit
Integer-quantized accuracy 74.9% 76.6% 76.7%
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4. Table 4.1: ResNet on ImageNet: Floating-point vs quan-
tized network accuracy for various network depths.
Typical TensorFlow code illustrating use of [19] follows:

from tf.contrib.quantize \
Scheme BWN TWN INQ FGQ Ours
import quantize_graph as qg
Weight bits 1 2 5 2 8
g = tf.Graph() Activation bits float32 float32 float32 8 8
with g.as_default():
output = ... Accuracy 68.7% 72.5% 74.8% 70.8% 74.9%
total_loss = ...
optimizer = ... Table 4.2: ResNet on ImageNet: Accuracy under var-
train_tensor = ... ious quantization schemes, including binary weight net-
if is_training: works (BWN [21, 15]), ternary weight networks (TWN
quantized_graph = \ [21, 22]), incremental network quantization (INQ [33]) and
qg.create_training_graph(g) fine-grained quantization (FGQ [26])
else:
quantized_graph = \
qg.create_eval_graph(g) 4. Experiments
# Train or evaluate quantized_graph.
We conducted two set of experiments, one showcas-
ing the effectiveness of quantized training (Section. 4.1),
and the other illustrating the improved latency-vs-accuracy
3.2. Batch normalization folding
tradeoff of quantized models on common hardware (Sec-
For models that use batch normalization (see [17]), there tion. 4.2). The most performance-critical part of the infer-
is additional complexity: the training graph contains batch ence workload on the neural networks being benchmarked
normalization as a separate block of operations, whereas is matrix multiplication (GEMM). The 8-bit and 32-bit
the inference graph has batch normalization parameters floating-point GEMM inference code uses the gemmlowp
“folded” into the convolutional or fully connected layer’s library [18] for 8-bit quantized inference, and the Eigen li-
weights and biases, for efficiency. To accurately simulate brary [6] for 32-bit floating-point inference.
quantization effects, we need to simulate this folding, and
quantize weights after they have been scaled by the batch 4.1. Quantized training of Large Networks
normalization parameters. We do so with the following: We apply quantized training to ResNets [9] and Incep-
tionV3 [30] on the ImageNet dataset. These popular net-
γw works are too computationally intensive to be deployed on
wfold := p 2)+ε
. (14)
EM A(σB mobile devices, but are included for comparison purposes.
Training protocols are discussed in Appendix D.1 and D.2.
Here γ is the batch normalization’s scale parameter,
2
EM A(σB ) is the moving average estimate of the variance
4.1.1 ResNets
of convolution results across the batch, and ε is just a small
constant for numerical stability. We compare floating-point vs integer-quantized ResNets
After folding, the batch-normalized convolutional layer for various depths in table 4.1. Accuracies of integer-only
reduces to the simple convolutional layer depicted in fig- quantized networks are within 2% of their floating-point
ure 1.1a with the folded weights wfold and the correspond- counterparts.
ing folded biases. Therefore the same recipe in figure 1.1b We also list ResNet50 accuracies under different quan-
applies. See the appendix for the training graph (figure C.5) tization schemes in table 4.2. As expected, integer-only
for a batch-normalized convolutional layer, the correspond- quantization outperforms FGQ [26], which uses 2 bits for
ing inference graph (figure C.6), the training graph after weight quantization. INQ [33] (5-bit weight floating-point
batch-norm folding (figure C.7) and the training graph af- activation) achieves a similar accuracy as ours, but we pro-
ter both folding and quantization (figure C.8). vide additional run-time improvements (see section 4.2).
Act. type accuracy recall 5 70
mean std. dev. mean std.dev.

Top 1 Accuracy
ReLU6 floats 78.4% 0.1% 94.1% 0.1% 60
8 bits 75.4% 0.1% 92.5% 0.1%
7 bits 75.0% 0.3% 92.4% 0.2%
ReLU floats 78.3% 0.1% 94.2% 0.1% 50
8 bits 74.2% 0.2% 92.2% 0.1% Float
7 bits 73.7% 0.3% 92.0% 0.1% 8-bit
40
5 15 30 60 120
Latency (ms)
Table 4.3: Inception v3 on ImageNet: Accuracy and recall
5 comparison of floating point and quantized models.
Figure 4.1: ImageNet classifier on Qualcomm Snapdragon
835 big cores: Latency-vs-accuracy tradeoff of floating-
4.1.2 Inception v3 on ImageNet point and integer-only MobileNets.

We compare the Inception v3 model quantized into 8 and 7


70
bits, respectively. 7-bit quantization is obtained by setting
the number of quantization levels in equation 12 to n = 27 .

Top 1 Accuracy
We additionally probe the sensitivity of activation quanti-
60
zation by comparing networks with two activation nonlin-
earities, ReLU6 and ReLU. The training protocol is in Ap-
pendix D.2.
50
Table 4.3 shows that 7-bit quantized training produces Float
model accuracies close to that of 8-bit quantized train- 8-bit
ing, and quantized models with ReLU6 have less accuracy
40
degradation. The latter can be explained by noticing that 5 15 30 60 120
ReLU6 introduces the interval [0, 6] as a natural range for Latency (ms)
activations, while ReLU allows activations to take values
from a possibly larger interval, with different ranges in dif- Figure 4.2: ImageNet classifier on Qualcomm Snapdragon
ferent channels. Values in a fixed range are easier to quan- 821: Latency-vs-accuracy tradeoff of floating-point and
tize with high precision. integer-only MobileNets.
4.2. Quantization of MobileNets
MobileNets are a family of architectures that achieve a time budget. The accuracy gap is quite substantial (∼ 10%)
state-of-the-art tradeoff between on-device latency and Im- for Snapdragon 835 LITTLE cores at the 33ms latency
ageNet classification accuracy. In this section we demon- needed for real-time (30 fps) operation. While most of the
strate how integer-only quantization can further improve the quantization literature focuses on minimizing accuracy loss
tradeoff on common hardware. for a given architecture, we advocate for a more compre-
hensive latency-vs-accuracy tradeoff as a better measure.
Note that this tradeoff depends critically on the relative
4.2.1 ImageNet
speed of floating-point vs integer-only arithmetic in hard-
We benchmarked the MobileNet architecture with vary- ware. Floating-point computation is better optimized in the
ing depth-multipliers (DM) and resolutions on ImageNet Snapdragon 821, for example, resulting in a less noticeable
on three types of Qualcomm cores, which represent three reduction in latency for quantized models.
different micro-architectures: 1) Snapdragon 835 LITTLE
core, (figure. 1.1c), a power-efficient processor found in 4.2.2 COCO
Google Pixel 2; 2) Snapdragon 835 big core (figure. 4.1), a
high-performance core employed by Google Pixel 2; and 3) We evaluated quantization in the context of mobile real time
Snapdragon 821 big core (figure. 4.2), a high-performance object detection, comparing the performance of quantized
core used in Google Pixel 1. 8-bit and float models of MobileNet SSD [10, 25] on the
Integer-only quantized MobileNets achieve higher accu- COCO dataset [24]. We replaced all the regular convolu-
racies than floating-point MobileNets given the same run- tions in the SSD prediction layers with separable convolu-
DM Type mAP LITTLE (ms) big (ms) DM type Precision Recall
100% floats 22.1 778 370 100% floats 68% 76%
8 bits 21.7 687 272 8 bits 66% 75%
50% floats 16.7 270 121 50% floats 65% 70%
8 bits 16.6 146 61 8 bits 62% 70%
25% floats 56% 64%
Table 4.4: Object detection speed and accuracy on COCO 8 bits 54% 63%
dataset of floating point and integer-only quantized models.
Latency (ms) is measured on Qualcomm Snapdragon 835
Table 4.5: Face detection accuracy of floating point and
big and LITTLE cores.
integer-only quantized models. The reported precision
/ recall is averaged over different precision / recall val-
tions (depthwise followed by 1 × 1 projection). This modi- ues where an IOU of x between the groundtruth and pre-
fication is consistent with the overall design of MobileNets dicted windows is considered a correct detection, for x in
and makes them more computationally efficient. We uti- {0.5, 0.55, . . . , 0.95}.
lized the Open Source TensorFlow Object Detection API
[12] to train and evaluate our models. The training protocol DM type LITTLE Cores big Cores
is described in Appendix D.3. We also delayed quantiza- 1 2 4 1 2 4
tion for 500 thousand steps (see section 3.1), finding that it
100% floats 711 – – 337 – –
significantly decreases the time to convergence.
8 bits 372 238 167 154 100 69
Table 4.4 shows the latency-vs-accuracy tradeoff be-
tween floating-point and integer-quantized models. Latency 50% floats 233 – – 106 – –
was measured on a single thread using Snapdragon 835 8 bits 134 96 74 56 40 30
cores (big and LITTLE). Quantized training and inference 25% floats 100 – – 44 – –
results in up to a 50% reduction in running time, with a 8 bits 67 52 43 28 22 18
minimal loss in accuracy (−1.8% relative).

4.2.3 Face detection Table 4.6: Face detection: latency of floating point and
To better examine quantized MobileNet SSD on a smaller quantized models on Qualcomm Snapdragon 835 cores.
scale, we benchmarked face detection on the face attribute
classification dataset (a Flickr-based dataset used in [10]).
We contacted the authors of [10] to evaluate our quantized Since quantized training results in little accuracy degrada-
MobileNets on detection and face attributes following the tion, we see an improved tradeoff even though the Qual-
same protocols (detailed in Appendix D.4). comm Snapdragon 821 is highly optimized for floating
As indicated by tables 4.5 and 4.6, quantization provides point arithmetic (see Figure 4.2 for comparison).
close to a 2× latency reduction with a Qualcomm Snap-
dragon 835 big or LITTLE core at the cost of a ∼ 2% drop
Average precision

in the average precision. Notably, quantization allows the 0.88


25% face detector to run in real-time (1K/28 ≈ 36 fps) on
a single big core, whereas the floating-point model remains 0.86
slower than real-time (1K/44 ≈ 23 fps). 0.84 Float
We additionally examine the effect of multi-threading on 8-bit
the latency of quantized models. Table 4.6 shows a 1.5 to 0.82
1 2 4 8 16
2.2×) speedup when using 4 cores. The speedup ratios are
comparable between the two cores, and are higher for larger Latency (ms)
models where the overhead of multi-threading occupies a
smaller fraction of the total computation. Figure 4.3: Face attribute classifier on Qualcomm Snap-
dragon 821: Latency-vs-accuracy tradeoff of floating-point
and integer-only MobileNets.
4.2.4 Face attributes
Figure 4.3 shows the latency-vs-accuracy tradeoff of face Ablation study To understand performance sensitivity
attribute classification on the Qualcomm Snapdragon 821. to the quantization scheme, we further evaluate quantized
act. References
8 7 6 5 4
wt.
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
8 -0.9% -0.3% -0.4% -1.3% -3.5% C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
7 -1.3% -0.5% -1.2% -1.0% -2.6% Tensorflow: Large-scale machine learning on heterogeneous
6 -1.1% -1.2% -1.6% -1.6% -3.1% systems, 2015. Software available from tensorflow. org, 1,
5 -3.1% -3.7% -3.4% -3.4% -4.8% 2015. 5, 11, 12, 13
4 -11.4% -13.6% -10.8% -13.1% -14.0% [2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and
Y. Chen. Compressing neural networks with the hashing
trick. CoRR, abs/1504.04788, 2015. 1
Table 4.7: Face attributes: relative average category preci-
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
sion of integer-quantized MobileNets (varying weight and Fei. Imagenet: A large-scale hierarchical image database.
activation bit depths) compared with floating point. In Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 2,
11
act.
8 7 6 5 4 [4] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-
wt.
ing deep convolutional networks using vector quantization.
8 -1.3% -1.6% -3.2% -6.0% -9.8% arXiv preprint arXiv:1412.6115, 2014. 1
7 -1.8% -1.2% -4.6% -7.0% -9.9% [5] Google. TensorFlow Lite.
6 -2.1% -4.9% -2.6% -7.3% -9.6% https://www.tensorflow.org/mobile/tflite.
5 -3.1% -6.1% -7.8% -4.4% -10.0% 2, 3, 4, 11
4 -10.6% -20.8% -17.9% -19.0% -19.5%
[6] G. Guennebaud, B. Jacob, et al. Eigen v3.
http://eigen.tuxfamily.org. 6
Table 4.8: Face attributes: Age precision at difference of [7] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan.
Deep learning with limited numerical precision. In Pro-
5 years for quantized model (varying weight and activation
ceedings of the 32nd International Conference on Machine
bit depths) compared with floating point. Learning (ICML-15), pages 1737–1746, 2015. 2
[8] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-
pressing deep neural network with pruning, trained quantiza-
training with varying weight and activation quantization bit tion and huffman coding. CoRR, abs/1510.00149, 2, 2015.
depths. The degradation in average precision for binary at- 1
tributes and age precision relative to the floating-point base- [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
line are shown in Tables 4.7 and 4.8, respectively. The ta- ing for image recognition. In Proceedings of the IEEE con-
bles suggest that 1) weights are more sensitive to reduced ference on computer vision and pattern recognition, pages
quantization bit depth than activations, 2) 8 and 7-bit quan- 770–778, 2016. 6
tized models perform similarly to floating point models, and [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
3) when the total bit-depths are equal, it is better to keep T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision appli-
weight and activation bit depths the same.
cations. CoRR, abs/1704.04861, 2017. 1, 2, 7, 8, 12, 13
[11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Wein-
5. Discussion berger. Densely connected convolutional networks. In The
IEEE Conference on Computer Vision and Pattern Recogni-
We propose a quantization scheme that relies only on tion (CVPR), July 2017. 1
integer arithmetic to approximate the floating-point com- [12] J. Huang, V. Rathod, D. Chow, C. Sun, and M. Zhu. Tensor-
putations in a neural network. Training that simulates the flow object detection api, 2017. 8
effect of quantization helps to restore model accuracy to [13] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.
near-identical levels as the original. In addition to the 4×
Speed/accuracy trade-offs for modern convolutional object
reduction of model size, inference efficiency is improved
detectors. arXiv preprint arXiv:1611.10012, 2016. 12
via ARM NEON-based implementations. The improve-
[14] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
ment advances the state-of-the-art tradeoff between latency Y. Bengio. Binarized neural networks. In Advances in neural
on common ARM CPUs and the accuracy of popular com- information processing systems, pages 4107–4115, 2016. 1,
puter vision models. The synergy between our quantiza- 2
tion scheme and efficient architecture design suggests that [15] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
integer-arithmetic-only inference could be a key enabler Y. Bengio. Quantized neural networks: Training neural net-
that propels visual recognition technologies into the real- works with low precision weights and activations. arXiv
time and low-end phone market. preprint arXiv:1609.07061, 2016. 6
[16] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. [32] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy extremely efficient convolutional neural network for mobile
with 50x fewer parameters and¡ 1mb model size. arXiv devices. CoRR, abs/1707.01083, 2017. 1
preprint arXiv:1602.07360, 2016. 1 [33] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremen-
[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tal network quantization: Towards lossless cnns with low-
deep network training by reducing internal covariate shift. precision weights. arXiv preprint arXiv:1702.03044, 2017.
In Proceedings of the 32Nd International Conference on In- 1, 6
ternational Conference on Machine Learning - Volume 37, [34] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.
ICML’15, pages 448–456. JMLR.org, 2015. 5, 6 Dorefa-net: Training low bitwidth convolutional neural
[18] B. Jacob, P. Warden, et al. gemmlowp: a networks with low bitwidth gradients. arXiv preprint
small self-contained low-precision gemm library. arXiv:1606.06160, 2016. 1, 2
https://github.com/google/gemmlowp. 2, [35] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary
4, 6, 11 quantization. arXiv preprint arXiv:1612.01064, 2016. 1
[19] S. Kligys, S. Sivakumar, et al. Ten-
sorflow quantized training support.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize.
5, 6
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 1
[21] C. Leng, H. Li, S. Zhu, and R. Jin. Extremely low bit neural
network: Squeeze the last bit out with admm. arXiv preprint
arXiv:1707.09870, 2017. 1, 6
[22] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv
preprint arXiv:1605.04711, 2016. 1, 6
[23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014. 2
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, 2014. 7
[25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed.
Ssd: Single shot multibox detector. arXiv preprint
arXiv:1512.02325, 2015. 7
[26] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul,
and P. Dubey. Ternary neural networks with fine-grained
quantization. arXiv preprint arXiv:1705.01462, 2017. 1, 6
[27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
net: Imagenet classification using binary convolutional neu-
ral networks. arXiv preprint arXiv:1603.05279, 2016. 1, 2
[28] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 1
[29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015. 1
[30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2818–2826, 2016. 6
[31] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the
speed of neural networks on cpus. In Proc. Deep Learning
and Unsupervised Feature Learning NIPS Workshop, vol-
ume 1, page 4, 2011. 2
A. Appendix: Layer-specific details The problem is that the “rounding right shift” instruction,
RSHL with variable negative offset, breaks ties by rounding
A.1. Mathematical functions upward, instead of rounding them away from zero. For ex-
Math functions such as hyperbolic tangent, the logistic ample, if we use RSHL to implement the division −12/23,
function, and softmax often appear in neural networks. No the result will be −1 whereas it should be −2 with “round
lookup tables are needed since these functions are imple- to nearest”. This is problematic as it results in an overall
mented in pure fixed-point arithmetic similarly to how they upward bias, which has been observed to cause significant
would be implemented in floating-point arithmetic7 . loss of end-to-end accuracy in neural network inference. A
correct round-to-nearest right-shift can still be implemented
A.2. Addition using RSHL but with suitable fix-up arithmetic around it11 .
Some neural networks use a plain Addition layer type, For efficient NEON implementation of the matrix mul-
that simply adds two activation arrays together. Such Addi- tiplication’s core accumulation, we use the following trick.
tion layers are more expensive in quantized inference com- In the multiply-add operation in (10), we first change the
pared to floating-point because rescaling is needed: one in- operands’ type from uint8 to int8 (which can be done by
put needs to be rescaled onto the other’s scale using a fixed- subtracting 128 from the quantized values and zero-points).
point multiplication by the multiplier M = S1 /S2 similar Thus the core multiply-add becomes
to what we have seen earlier (end of section 2.2), before
the actual addition can be performed as a simple integer ad- int32 += int8 * int8. (B.1)
dition; finally, the result must be rescaled again to fit the
output array’s scale8 . As mentioned in section 3, with a minor tweak of the quan-
tized training process, we can ensure that the weights, once
A.3. Concatenation
quantized as int8 values, never take the value −128. Hence,
Fully general support for concatenation layers poses the the product in (B.1) is never −128 ∗ −128, and is there-
same rescaling problem as Addition layers. Because such fore always less than 214 in absolute value. Hence, (B.1)
rescaling of uint8 values would be a lossy operation, and can accumulate two products on a local int16 accumulator
as it seems that concatenation ought to be a lossless opera- before that needs to be accumulated into the true int32 ac-
tion, we prefer to handle this problem differently: instead of cumulator. This allows the use of an 8-way SIMD multi-
implementing lossy rescaling, we introduce a requirement plication (SMULL on int8 operands), followed by an 8-way
that all the input activations and the output activations in a SIMD multiply-add (SMLAL on int8 operands), followed
Concatenation layer have the same quantization parameters. by a pairwise-add-and-accumulate into the int32 accumula-
This removes the need for rescaling and concatenations are tors (SADALP)12 .
thus lossless and free of any arithmetic9 .

B. Appendix: ARM NEON details C. Appendix: Graph diagrams

This section assumes familiarity with assembly pro- D. Experimental protocols


gramming on the ARM NEON instruction set. The instruc-
tion mnemonics below refer to the 64-bit ARM instruction D.1. ResNet protocol
set, but the discussion applies equally to 32-bit ARM in- Preprocessing. All images from ImageNet [3] are re-
structions. sized preserving aspect ratio so that the smallest side of the
The fixed-point multiplications referenced throughout image is 256. Then the center 224 × 224 patch is cropped
this article map exactly to the SQRDMULH instruction. It and the means are subtracted for each of the RGB channels.
is very important to use the correctly-rounding instruction Optimization. We use the momentum optimizer from
SQRDMULH and not SQDMULH10 . TensorFlow [1] with momentum 0.9 and a batch size of 32.
The rounding-to-nearest right-shifts referenced in sec- The learning rate starts from 10−5 and decays in a staircase
tion 2.2 do not map exactly to any ARM NEON instruction. fashion by 0.1 for every 30 epochs. Activation quantization
7 Pure-arithmetic, SIMD-ready, branch-free, fixed-point implementa- is delayed for 500, 000 steps for reasons discussed in section
tions of at least tanh and the logistic functions are given in gemmlowp 3. Training uses 50 workers asynchronously, and stops after
[18]’s fixedpoint directory, with specializations for NEON and SSE in-
validation accuracy plateaus, normally after 100 epochs.
struction sets. One can see in TensorFlow Lite [5] how these are called.
8 See the TensorFlow Lite [5] implementation.
9 This is implemented in this part of the TensorFlow Lite [5] Converter 11 Itis implemented here in gemmlowp [18].
10 The fixed-point math function implementations in gemmlowp [18] use 12 This technique is implemented in the optimized NEON kernel in
such fixed-point multiplications, and ordinary (non-saturating) integer ad- gemmlowp [18], which is in particular what TensorFlow Lite uses (see
ditions. We have no use for general saturated arithmetic. the choice of L8R8WithLhsNonzeroBitDepthParams at this line).
output output

ReLU6 ReLU6

+ +

biases
conv +

weights biases
conv

input
weights
Figure C.1: Simple graph: original
input
output
Figure C.3: Layer with a bypass connection: original

act quant
caying exponentially and stepwise with factor 0.94 after ev-
ery 2 epochs. Other RMSProp parameters were: 0.9 mo-
mentum, 0.9 decay, 1.0 epsilon term. Trained parameters
ReLU6 were EMA averaged with decay 0.9999.
D.3. COCO detection protocol
+ Preprocessing. During training, all images are ran-
domly cropped and resized to 320 × 320. During evalua-
tion, all images are directly resized to 320 × 320. All input
biases
conv
values are normalized to [−1, 1].
Optimization. We used the RMSprop optimizer from
TensorFlow [1] with a batch size of 32. The learning rate
wt quant starts from 4 × 10−3 and decays in a staircase fashion by a
factor of 0.1 for every 100 epochs. Activation quantization
is delayed for 500, 000 steps for reasons discussed in section
weights 3. Training uses 20 workers asynchronously, and stops after
validation accuracy plateaus, normally after approximately
input
6 million steps.
Metrics. Evaluation results are reported with the COCO
Figure C.2: Simple graph: quantized primary challenge metric: AP at IoU=.50:.05:.95. We fol-
low the same train/eval split in [13].
D.4. Face detection and face attribute classification
D.2. Inception protocol protocol
All results in table 4.3 were obtained after training for Preprocessing. Random 1:1 crops are taken from im-
approximately 10 million steps, with batches of 32 samples, ages in the Flickr-based dataset used in [10] and resized to
using 50 distributed workers, asynchronously. Training data 320 × 320 pixels for face detection and 128 × 128 pixels for
were ImageNet 2012 299 × 299 images with labels. Image face attribute classification. The resulting crops are flipped
augmentation consisted of: random crops, random horizon- horizontally with a 50% probability. The values for each
tal flips, and random color distortion. The optimizer used of the RGB channels are renormalized to be in the range
was RMSProp with learning rate starting at 0.045 and de- [−1, 1].
output output

act quant ReLU6

ReLU6 γ(x − µ)/σ + β

β γ MA µ, σ
+

moments
conv quant

conv

+
weights

biases
input
conv

Figure C.5: Convolutional layer with batch normalization:


wt quant training graph

output

weights

ReLU6
input

Figure C.4: Layer with a bypass connection: quantized


+

Face Detection Optimization. We used the RMSprop β − γµ/σ


optimizer from TensorFlow [1] with a batch size of 32. The conv
learning rate starts from 4 × 10−3 and decays in a stair-
case fashion by a factor of 0.1 for every 100 epochs. Ac-
wγ/σ
tivation quantization is delayed for 500, 000 steps for rea-
sons discussed in section 3. Training uses 20 workers asyn-
chronously, and stops after validation accuracy plateaus, input
normally after approximately 3 million steps.
Face Attribute Classification Optimization. We fol- Figure C.6: Convolutional layer with batch normalization:
lowed the optimization protocol in [10]. We used the Ada- inference graph
grad optimizer from Tensorflow[1] with a batch size of 32
and a constant learning rate of 0.1. Training uses 12 work-
ers asynchronously, and stops at 20 million steps. 320 × 320 inputs, and of the face attributes classifier model
on 128 × 128 inputs.
Latency Measurements. We created a binary that runs
the face detection and face attributes classification models
repeatedly on random inputs for 100 seconds. We pushed
this binary to Pixel and Pixel 2 phones using the adb
push command, and executed it on 1, 2, and 4 LITTLE
cores, and 1, 2, and 4 big cores using the adb shell
command with the appropriate taskset specified. We re-
ported the average runtime of the face detector model on
output

output act quant

ReLU6 ReLU6

+ +

conv fold conv fold

wγ/σ β − γµ/σ wt quant

γ MA µ, σ β
wγ/σ β − γµ/σ

moments γ MA µ, σ β

conv moments

weights
conv

input
weights
Figure C.7: Convolutional layer with batch normalization:
training graph, folded
input

Figure C.8: Convolutional layer with batch normalization:


training graph, folded and quantized

You might also like