Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
Integer-Arithmetic-Only Inference
{benoitjacob,skligys,bochen,menglong,
mttang,howarda,hadam,dkalenichenko}@google.com
Google Inc.
1
uint8 70
ReLU6 output ReLU6 act quant output
uint8
Top 1 Accuracy
60
+ +
uint32
uint32
50
biases biases
conv conv Float
8-bit
uint8 uint8
40
10 20 40 80 160 320
input weights input wt quant weights
Latency (ms)
(a) Integer-arithmetic-only inference (b) Training with simulated quantization (c) ImageNet latency-vs-accuracy tradeoff
Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output
are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator.
The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training
with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic.
Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the
effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel
a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from
the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares
integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.
tions [14, 27, 34]. With these approaches, both multiplica- Our work draws inspiration from [7], which leverages
tions and additions can be implemented by efficient bit-shift low-precision fixed-point arithmetic to accelerate the train-
and bit-count operations, which are showcased in custom ing speed of CNNs, and from [31], which uses 8-bit fixed-
GPU kernels (BNN [14]). However, 1 bit quantization of- point arithmetic to speed up inference on x86 CPUs. Our
ten leads to substantial performance degradation, and may quantization scheme focuses instead on improving the in-
be overly stringent on model representation. ference speed vs accuracy tradeoff on mobile CPUs.
In this paper we address the above issues by improving
the latency-vs-accuracy tradeoffs of MobileNets on com- 2. Quantized Inference
mon mobile hardware. Our specific contributions are:
2.1. Quantization scheme
• We provide a quantization scheme (section 2.1) that In this section, we describe our general quantization
quantizesh both weights and activations as 8-bit integers, scheme12 , that is, the correspondence between the bit-
and just a few parameters (bias vectors) as 32-bit integers. representation of values (denoted q below, for “quantized
• We provide a quantized inference framework that is ef- value”) and their interpretation as mathematical real num-
ficiently implementable on integer-arithmetic-only hard- bers (denoted r below, for “real value”). Our quantization
ware such as the Qualcomm Hexagon (sections 2.2, 2.3), scheme is implemented using integer-only arithmetic dur-
and we describe an efficient, accurate implementation on ing inference and floating-point arithmetic during training,
ARM NEON (Appendix B). with both implementations maintaining a high degree of
correspondence with each other. We achieve this by first
• We provide a quantized training framework (section 3) providing a mathematically rigorous definition of our quan-
co-designed with our quantized inference to minimize the tization scheme, and separately adopting this scheme for
loss of accuracy from quantization on real models. both integer-arithmetic inference and floating-point train-
ing.
• We apply our frameworks to efficient classification and
detection systems based on MobileNets and provide 1 The quantization scheme described here is the one adopted in Tensor-
benchmark results on popular ARM CPUs (section 4) Flow Lite [5] and we will refer to specific parts of its code to illustrate
that show significant improvements in the latency-vs- aspects discussed below.
2 We had earlier described this quantization scheme in the documen-
accuracy tradeoffs for state-of-the-art MobileNet archi- tation of gemmlowp [18]. That page may still be useful as an alternate
tectures, demonstrated in ImageNet classification [3], treatment of some of the topics developed in this section, and for its self-
COCO object detection [23], and other tasks. contained example code.
A basic requirement of our quantization scheme is that it computation, and how the latter can be designed to involve
permits efficient implementation of all arithmetic using only only integer arithmetic even though the scale values S are
integer arithmetic operations on the quantized values (we not integers.
eschew implementations requiring lookup tables because Consider the multiplication of two square N × N ma-
these tend to perform poorly compared to pure arithmetic trices of real numbers, r1 and r2 , with their product repre-
on SIMD hardware). This is equivalent to requiring that the sented by r3 = r1 r2 . We denote the entries of each of these
(i,j)
quantization scheme be an affine mapping of integers q to matrices rα (α = 1, 2 or 3) as rα for 1 6 i, j 6 N ,
real numbers r, i.e. of the form and the quantization parameters with which they are quan-
(i,j)
tized as (Sα , Zα ). We denote the quantized entries by qα .
r = S(q − Z) (1)
Equation (1) then becomes:
for some constants S and Z. Equation (1) is our quantiza-
rα(i,j) = Sα (qα(i,j) − Zα ). (2)
tion scheme and the constants S and Z are our quantization
parameters. Our quantization scheme uses a single set of From the definition of matrix multiplication, we have
quantization parameters for all values within each activa-
tions array and within each weights array; separate arrays N
(i,k) (i,j) (j,k)
X
use separate quantization parameters. S3 (q3 − Z3 ) = S1 (q1 − Z1 )S2 (q2 − Z2 ), (3)
For 8-bit quantization, q is quantized as an 8-bit integer j=1
(for B-bit quantization, q is quantized as an B-bit integer).
Some arrays, typically bias vectors, are quantized as 32-bit which can be rewritten as
integers, see section 2.4. N
(i,k) (i,j) (j,k)
X
The constant S (for “scale”) is an arbitrary positive real q3 = Z3 + M (q1 − Z1 )(q2 − Z2 ), (4)
number. It is typically represented in software as a floating- j=1
point quantity, like the real values r. Section 2.2 describes
methods for avoiding the representation of such floating- where the multiplier M is defined as
point quantities in the inference workload. S1 S2
The constant Z (for “zero-point”) is of the same type M := . (5)
S3
as quantized values q, and is in fact the quantized value q
corresponding to the real value 0. This allows us to auto- In Equation (4), the only non-integer is the multiplier M .
matically meet the requirement that the real value r = 0 be As a constant depending only on the quantization scales
exactly representable by a quantized value. The motivation S1 , S2 , S3 , it can be computed offline. We empirically find
for this requirement is that efficient implementation of neu- it to always be in the interval (0, 1), and can therefore ex-
ral network operators often requires zero-padding of arrays press it in the normalized form
around boundaries.
Our discussion so far is summarized in the following M = 2−n M0 (6)
quantized buffer data structure3 , with one instance of such a
buffer existing for each activations array and weights array where M0 is in the interval [0.5, 1) and n is a non-negative
in a neural network. We use C++ syntax because it allows integer. The normalized multiplier M0 now lends itself well
the unambiguous conveyance of types. to being expressed as a fixed-point multiplier (e.g. int16 or
int32 depending on hardware capability). For example, if
template<typename QType> // e.g. QType=uint8 int32 is used, the integer representing M0 is the int32 value
struct QuantizedBuffer {
vector<QType> q; // the quantized values nearest to 231 M0 . Since M0 > 0.5, this value is always at
float S; // the scale least 230 and will therefore always have at least 30 bits of
QType Z; // the zero-point relative accuracy. Multiplication by M0 can thus be imple-
}; mented as a fixed-point multiplication4. Meanwhile, multi-
2.2. Integer-arithmetic-only matrix multiplication plication by 2−n can be implemented with an efficient bit-
shift, albeit one that needs to have correct round-to-nearest
We now turn to the question of how to perform inference behavior, an issue that we return to in Appendix B.
using only integer arithmetic, i.e. how to use Equation (1)
to translate real-numbers computation into quantized-values 2.3. Efficient handling of zero-points
3 The actual data structures in the TensorFlow Lite [5] Converter are In order to efficiently implement the evaluation of Equa-
QuantizationParams and Array in this header file. As we discuss tion (4) without having to perform 2N 3 subtractions and
in the next subsection, this data structure, which still contains a floating-
point quantity, does not appear in the actual quantized on-device inference 4 The computation discussed in this section is implemented in Tensor-
for e.g. a Convolutional operator (reference code is self-contained, opti- 6 The quantization of bias-vectors discussed here is implemented here
mized code calls into gemmlowp [18]). in the TensorFlow Lite [5] Converter.
3. Training with simulated quantization 3.1. Learning quantization ranges
A common approach to training quantized networks is Quantization ranges are treated differently for weight
to train in floating point and then quantize the resulting quantization vs. activation quantization:
weights (sometimes with additional post-quantization train- • For weights, the basic idea is simply to set a := min w,
ing for fine-tuning). We found that this approach works b := max w. We apply a minor tweak to this so that
sufficiently well for large models with considerable repre- the weights, once quantized as int8 values, only range
sentational capacity, but leads to significant accuracy drops in [−127, 127] and never take the value −128, as this en-
for small models. Common failure modes for simple post- ables a substantial optimization opportunity (for more de-
training quantization include: 1) large differences (more tails, see Appendix B).
than 100×) in ranges of weights for different output chan-
• For activations, ranges depend on the inputs to the net-
nels (section 2 mandates that all channels of the same layer
work. To estimate the ranges, we collect [a; b] ranges
be quantized to the same resolution, which causes weights
seen on activations during training and then aggregate
in channels with smaller ranges to have much higher relative
them via exponential moving averages (EMA) with the
error) and 2) outlier weight values that make all remaining
smoothing parameter being close to 1 so that observed
weights less precise after quantization.
ranges are smoothed across thousands of training steps.
We propose an approach that simulates quantization ef- Given significant delay in the EMA updating activation
fects in the forward pass of training. Backpropagation still ranges when the ranges shift rapidly, we found it useful
happens as usual, and all weights and biases are stored in to completely disable activation quantization at the start
floating point so that they can be easily nudged by small of training (say, for 50 thousand to 2 million steps). This
amounts. The forward propagation pass however simu- allows the network to enter a more stable state where ac-
lates quantized inference as it will happen in the inference tivation quantization ranges do not exclude a significant
engine, by implementing in floating-point arithmetic the fraction of values.
rounding behavior of the quantization scheme that we in-
troduced in section 2: In both cases, the boundaries [a; b] are nudged so that
value 0.0 is exactly representable as an integer z(a, b, n)
• Weights are quantized before they are convolved with after quantization. As a result, the learned quantization pa-
the input. If batch normalization (see [17]) is used for rameters map to the scale S and zero-point Z in equation 1:
the layer, the batch normalization parameters are “folded S = s(a, b, n), Z = z(a, b, n) (13)
into” the weights before quantization, see section 3.2.
Below we depict simulated quantization assuming that
• Activations are quantized at points where they would be the computations of a neural network are captured as a Ten-
during inference, e.g. after the activation function is ap- sorFlow graph [1]. A typical workflow is described in Al-
plied to a convolutional or fully connected layer’s output, gorithm 1. Optimization of the inference graph by fusing
or after a bypass connection adds or concatenates the out-
puts of several layers together such as in ResNets. Algorithm 1 Quantized graph training and inference
1: Create a training graph of the floating-point model.
For each layer, quantization is parameterized by the 2: Insert fake quantization TensorFlow operations in lo-
number of quantization levels and clamping range, and is cations where tensors will be downcasted to fewer bits
performed by applying point-wise the quantization function during inference according to equation 12.
q defined as follows: 3: Train in simulated quantized mode until convergence.
4: Create and optimize the inference graph for running in
clamp(r; a, b) := min (max(x, a), b) a low bit inference engine.
b−a 5: Run inference using the quantized inference graph.
s(a, b, n) :=
n−1
clamp(r; a, b) − a and removing operations is outside the scope of this pa-
q(r; a, b, n) := s(a, b, n) + a,
s(a, b, n) per. Source code for graph modifications (inserting fake
(12) quantization operations, creating and optimizing the infer-
ence graph) and a low bit inference engine has been open-
where r is a real-valued number to be quantized, [a; b] is the sourced with TensorFlow contributions in [19].
quantization range, n is the number of quantization levels, Figure 1.1a and b illustrate TensorFlow graphs before
and ⌊·⌉ denotes rounding to the nearest integer. n is fixed and after quantization for a simple convolutional layer. Il-
for all layers in our experiments, e.g. n = 28 = 256 for 8 lustrations of the more complex convolution with a bypass
bit quantization. connection in figure C.3 can be found in figure C.4.
Note that the biases are not quantized because they are ResNet depth 50 100 150
represented as 32-bit integers in the inference process, with
Floating-point accuracy 76.4% 78.0% 78.8%
a much higher range and precision compared to the 8 bit
Integer-quantized accuracy 74.9% 76.6% 76.7%
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations. See section 2.4. Table 4.1: ResNet on ImageNet: Floating-point vs quan-
tized network accuracy for various network depths.
Typical TensorFlow code illustrating use of [19] follows:
from tf.contrib.quantize \
Scheme BWN TWN INQ FGQ Ours
import quantize_graph as qg
Weight bits 1 2 5 2 8
g = tf.Graph() Activation bits float32 float32 float32 8 8
with g.as_default():
output = ... Accuracy 68.7% 72.5% 74.8% 70.8% 74.9%
total_loss = ...
optimizer = ... Table 4.2: ResNet on ImageNet: Accuracy under var-
train_tensor = ... ious quantization schemes, including binary weight net-
if is_training: works (BWN [21, 15]), ternary weight networks (TWN
quantized_graph = \ [21, 22]), incremental network quantization (INQ [33]) and
qg.create_training_graph(g) fine-grained quantization (FGQ [26])
else:
quantized_graph = \
qg.create_eval_graph(g) 4. Experiments
# Train or evaluate quantized_graph.
We conducted two set of experiments, one showcas-
ing the effectiveness of quantized training (Section. 4.1),
and the other illustrating the improved latency-vs-accuracy
3.2. Batch normalization folding
tradeoff of quantized models on common hardware (Sec-
For models that use batch normalization (see [17]), there tion. 4.2). The most performance-critical part of the infer-
is additional complexity: the training graph contains batch ence workload on the neural networks being benchmarked
normalization as a separate block of operations, whereas is matrix multiplication (GEMM). The 8-bit and 32-bit
the inference graph has batch normalization parameters floating-point GEMM inference code uses the gemmlowp
“folded” into the convolutional or fully connected layer’s library [18] for 8-bit quantized inference, and the Eigen li-
weights and biases, for efficiency. To accurately simulate brary [6] for 32-bit floating-point inference.
quantization effects, we need to simulate this folding, and
quantize weights after they have been scaled by the batch 4.1. Quantized training of Large Networks
normalization parameters. We do so with the following: We apply quantized training to ResNets [9] and Incep-
tionV3 [30] on the ImageNet dataset. These popular net-
γw works are too computationally intensive to be deployed on
wfold := p 2)+ε
. (14)
EM A(σB mobile devices, but are included for comparison purposes.
Training protocols are discussed in Appendix D.1 and D.2.
Here γ is the batch normalization’s scale parameter,
2
EM A(σB ) is the moving average estimate of the variance
4.1.1 ResNets
of convolution results across the batch, and ε is just a small
constant for numerical stability. We compare floating-point vs integer-quantized ResNets
After folding, the batch-normalized convolutional layer for various depths in table 4.1. Accuracies of integer-only
reduces to the simple convolutional layer depicted in fig- quantized networks are within 2% of their floating-point
ure 1.1a with the folded weights wfold and the correspond- counterparts.
ing folded biases. Therefore the same recipe in figure 1.1b We also list ResNet50 accuracies under different quan-
applies. See the appendix for the training graph (figure C.5) tization schemes in table 4.2. As expected, integer-only
for a batch-normalized convolutional layer, the correspond- quantization outperforms FGQ [26], which uses 2 bits for
ing inference graph (figure C.6), the training graph after weight quantization. INQ [33] (5-bit weight floating-point
batch-norm folding (figure C.7) and the training graph af- activation) achieves a similar accuracy as ours, but we pro-
ter both folding and quantization (figure C.8). vide additional run-time improvements (see section 4.2).
Act. type accuracy recall 5 70
mean std. dev. mean std.dev.
Top 1 Accuracy
ReLU6 floats 78.4% 0.1% 94.1% 0.1% 60
8 bits 75.4% 0.1% 92.5% 0.1%
7 bits 75.0% 0.3% 92.4% 0.2%
ReLU floats 78.3% 0.1% 94.2% 0.1% 50
8 bits 74.2% 0.2% 92.2% 0.1% Float
7 bits 73.7% 0.3% 92.0% 0.1% 8-bit
40
5 15 30 60 120
Latency (ms)
Table 4.3: Inception v3 on ImageNet: Accuracy and recall
5 comparison of floating point and quantized models.
Figure 4.1: ImageNet classifier on Qualcomm Snapdragon
835 big cores: Latency-vs-accuracy tradeoff of floating-
4.1.2 Inception v3 on ImageNet point and integer-only MobileNets.
Top 1 Accuracy
We additionally probe the sensitivity of activation quanti-
60
zation by comparing networks with two activation nonlin-
earities, ReLU6 and ReLU. The training protocol is in Ap-
pendix D.2.
50
Table 4.3 shows that 7-bit quantized training produces Float
model accuracies close to that of 8-bit quantized train- 8-bit
ing, and quantized models with ReLU6 have less accuracy
40
degradation. The latter can be explained by noticing that 5 15 30 60 120
ReLU6 introduces the interval [0, 6] as a natural range for Latency (ms)
activations, while ReLU allows activations to take values
from a possibly larger interval, with different ranges in dif- Figure 4.2: ImageNet classifier on Qualcomm Snapdragon
ferent channels. Values in a fixed range are easier to quan- 821: Latency-vs-accuracy tradeoff of floating-point and
tize with high precision. integer-only MobileNets.
4.2. Quantization of MobileNets
MobileNets are a family of architectures that achieve a time budget. The accuracy gap is quite substantial (∼ 10%)
state-of-the-art tradeoff between on-device latency and Im- for Snapdragon 835 LITTLE cores at the 33ms latency
ageNet classification accuracy. In this section we demon- needed for real-time (30 fps) operation. While most of the
strate how integer-only quantization can further improve the quantization literature focuses on minimizing accuracy loss
tradeoff on common hardware. for a given architecture, we advocate for a more compre-
hensive latency-vs-accuracy tradeoff as a better measure.
Note that this tradeoff depends critically on the relative
4.2.1 ImageNet
speed of floating-point vs integer-only arithmetic in hard-
We benchmarked the MobileNet architecture with vary- ware. Floating-point computation is better optimized in the
ing depth-multipliers (DM) and resolutions on ImageNet Snapdragon 821, for example, resulting in a less noticeable
on three types of Qualcomm cores, which represent three reduction in latency for quantized models.
different micro-architectures: 1) Snapdragon 835 LITTLE
core, (figure. 1.1c), a power-efficient processor found in 4.2.2 COCO
Google Pixel 2; 2) Snapdragon 835 big core (figure. 4.1), a
high-performance core employed by Google Pixel 2; and 3) We evaluated quantization in the context of mobile real time
Snapdragon 821 big core (figure. 4.2), a high-performance object detection, comparing the performance of quantized
core used in Google Pixel 1. 8-bit and float models of MobileNet SSD [10, 25] on the
Integer-only quantized MobileNets achieve higher accu- COCO dataset [24]. We replaced all the regular convolu-
racies than floating-point MobileNets given the same run- tions in the SSD prediction layers with separable convolu-
DM Type mAP LITTLE (ms) big (ms) DM type Precision Recall
100% floats 22.1 778 370 100% floats 68% 76%
8 bits 21.7 687 272 8 bits 66% 75%
50% floats 16.7 270 121 50% floats 65% 70%
8 bits 16.6 146 61 8 bits 62% 70%
25% floats 56% 64%
Table 4.4: Object detection speed and accuracy on COCO 8 bits 54% 63%
dataset of floating point and integer-only quantized models.
Latency (ms) is measured on Qualcomm Snapdragon 835
Table 4.5: Face detection accuracy of floating point and
big and LITTLE cores.
integer-only quantized models. The reported precision
/ recall is averaged over different precision / recall val-
tions (depthwise followed by 1 × 1 projection). This modi- ues where an IOU of x between the groundtruth and pre-
fication is consistent with the overall design of MobileNets dicted windows is considered a correct detection, for x in
and makes them more computationally efficient. We uti- {0.5, 0.55, . . . , 0.95}.
lized the Open Source TensorFlow Object Detection API
[12] to train and evaluate our models. The training protocol DM type LITTLE Cores big Cores
is described in Appendix D.3. We also delayed quantiza- 1 2 4 1 2 4
tion for 500 thousand steps (see section 3.1), finding that it
100% floats 711 – – 337 – –
significantly decreases the time to convergence.
8 bits 372 238 167 154 100 69
Table 4.4 shows the latency-vs-accuracy tradeoff be-
tween floating-point and integer-quantized models. Latency 50% floats 233 – – 106 – –
was measured on a single thread using Snapdragon 835 8 bits 134 96 74 56 40 30
cores (big and LITTLE). Quantized training and inference 25% floats 100 – – 44 – –
results in up to a 50% reduction in running time, with a 8 bits 67 52 43 28 22 18
minimal loss in accuracy (−1.8% relative).
4.2.3 Face detection Table 4.6: Face detection: latency of floating point and
To better examine quantized MobileNet SSD on a smaller quantized models on Qualcomm Snapdragon 835 cores.
scale, we benchmarked face detection on the face attribute
classification dataset (a Flickr-based dataset used in [10]).
We contacted the authors of [10] to evaluate our quantized Since quantized training results in little accuracy degrada-
MobileNets on detection and face attributes following the tion, we see an improved tradeoff even though the Qual-
same protocols (detailed in Appendix D.4). comm Snapdragon 821 is highly optimized for floating
As indicated by tables 4.5 and 4.6, quantization provides point arithmetic (see Figure 4.2 for comparison).
close to a 2× latency reduction with a Qualcomm Snap-
dragon 835 big or LITTLE core at the cost of a ∼ 2% drop
Average precision
ReLU6 ReLU6
+ +
biases
conv +
weights biases
conv
input
weights
Figure C.1: Simple graph: original
input
output
Figure C.3: Layer with a bypass connection: original
act quant
caying exponentially and stepwise with factor 0.94 after ev-
ery 2 epochs. Other RMSProp parameters were: 0.9 mo-
mentum, 0.9 decay, 1.0 epsilon term. Trained parameters
ReLU6 were EMA averaged with decay 0.9999.
D.3. COCO detection protocol
+ Preprocessing. During training, all images are ran-
domly cropped and resized to 320 × 320. During evalua-
tion, all images are directly resized to 320 × 320. All input
biases
conv
values are normalized to [−1, 1].
Optimization. We used the RMSprop optimizer from
TensorFlow [1] with a batch size of 32. The learning rate
wt quant starts from 4 × 10−3 and decays in a staircase fashion by a
factor of 0.1 for every 100 epochs. Activation quantization
is delayed for 500, 000 steps for reasons discussed in section
weights 3. Training uses 20 workers asynchronously, and stops after
validation accuracy plateaus, normally after approximately
input
6 million steps.
Metrics. Evaluation results are reported with the COCO
Figure C.2: Simple graph: quantized primary challenge metric: AP at IoU=.50:.05:.95. We fol-
low the same train/eval split in [13].
D.4. Face detection and face attribute classification
D.2. Inception protocol protocol
All results in table 4.3 were obtained after training for Preprocessing. Random 1:1 crops are taken from im-
approximately 10 million steps, with batches of 32 samples, ages in the Flickr-based dataset used in [10] and resized to
using 50 distributed workers, asynchronously. Training data 320 × 320 pixels for face detection and 128 × 128 pixels for
were ImageNet 2012 299 × 299 images with labels. Image face attribute classification. The resulting crops are flipped
augmentation consisted of: random crops, random horizon- horizontally with a 50% probability. The values for each
tal flips, and random color distortion. The optimizer used of the RGB channels are renormalized to be in the range
was RMSProp with learning rate starting at 0.045 and de- [−1, 1].
output output
β γ MA µ, σ
+
moments
conv quant
conv
+
weights
biases
input
conv
output
weights
ReLU6
input
ReLU6 ReLU6
+ +
γ MA µ, σ β
wγ/σ β − γµ/σ
moments γ MA µ, σ β
conv moments
weights
conv
input
weights
Figure C.7: Convolutional layer with batch normalization:
training graph, folded
input