Binary Neural Networks
Binary Neural Networks
Deep learning has achieved impressive results in image classification, computer vision, and nat-
ural language processing. To achieve better performance, deeper and wider networks have been
designed, which increase the demand for computational resources. The number of floating-point
operations (FLOPs) has increased dramatically with larger networks, and this has become an
obstacle for convolutional neural networks (CNNs) being developed for mobile and embedded
devices. In this context, Binary Neural Networks: Algorithms, Architectures, and Applications
will focus on CNN compression and acceleration, which are important for the research commu-
nity. We will describe numerous methods, including parameter quantization, network pruning,
low-rank decomposition, and knowledge distillation. More recently, to reduce (from binary to
low-bit) the burden of handcrafted architecture design, neural architecture search (NAS) has
been used to automatically build neural networks by searching over a vast architecture space.
Our book will also introduce NAS and binary NAS and its superiority and state-of-the-art per-
formance in various applications, such as image classification and object detection. We also
describe extensive applications of compressed deep models on image classification, speech rec-
ognition, object detection, and tracking. These topics can help researchers better understand
the usefulness and the potential of network compression on practical applications. Moreover,
interested readers should have basic knowledge of machine learning and deep learning to better
understand the methods described in this book.
Key Features
• Reviews recent advances in CNN compression and acceleration
• Elaborates recent advances on binary neural network (BNN) technologies
• Introduces applications of BNN in image classification, speech recognition, object detec-
tion, and more
Multimedia Computing, Communication and Intelligence
Series Editor
Chang Wen Chen & Shiguo Lian
PUBLISHED
Effective Surveillance for Homeland Security:
Balancing Technology and Social Issues
By Francesco Flammini, Roberto Setola, and Giorgio Franceschetti
ISBN: 9781138199705
TV Content Analysis:
Techniques and Applications
By Yiannis Kompatsiaris, Bernard Merialdo, and Shiguo Lian
ISBN: 9780367900946
© 2024 Baochang Zhang, Sheng Xu, Mingbao Lin, Tiancheng Wang, David Doermann
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003376132
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Dedication
To all our collaborators working
on binary neural networks
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents
1 Introduction 1
1.1 Principal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Early Binary Neural Networks . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Gradient Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Structural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5 Loss Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.6 Neural Architecture Search . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.7 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Object Detection and Tracking . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Our Works on BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
vii
viii Contents
4.2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs . . . . . 98
4.3.1 Child-Parent Model for Network Binarization . . . . . . . . . . . . . 100
4.3.2 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.3 Search Strategy for CP-NAS . . . . . . . . . . . . . . . . . . . . . . 103
4.3.4 Optimization of the 1-Bit CNNs . . . . . . . . . . . . . . . . . . . . 103
4.3.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit
CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.2 Redefine Child-Parent Framework for Network Binarization . . . . . 107
4.4.3 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4.4 Tangent Propagation for DCP-NAS . . . . . . . . . . . . . . . . . . 109
4.4.5 Generalized Gauss-Newton Matrix (GGN) for Hessian Matrix . . . . 110
4.4.6 Decoupled Optimization for Training the DCP-NAS . . . . . . . . . 111
4.4.7 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography 179
Index 203
About the Authors
Baochang Zhang is a full professor with the Institute of Artificial Intelligence, Beihang
University, Beijing, China; and also with Zhongguancun Laboratory, Beijing, China. He
was selected by the Program for New Century Excellent Talents in the University of Min-
istry of Education of China, chosen as the Academic Advisor of the Deep Learning Lab of
Baidu Inc., and was honored as a Distinguished Researcher of Beihang Hangzhou Institute
in Zhejiang Province. His research interests include explainable deep learning, computer
vision, and pattern recognition. His HGPP and LDP methods were state-of-the-art feature
descriptors, with 1234 and 768 Google Scholar citations, respectively, and both “Test-of-
Time” works. His team’s 1-bit methods achieved the best performance on ImageNet. His
group also won the ECCV 2020 Tiny Object Detection, COCO Object Detection, and ICPR
2020 Pollen recognition challenges.
Mingbao Lin finished his MS-PhD study and obtained a PhD in intelligence science and
technology from Xiamen University, Xiamen, China in 2022. In 2016, he received a BS
from Fuzhou University, Fuzhou, China. He is currently a senior researcher with the Ten-
cent Youtu Lab, Shanghai, China. His publications on top-tier conferences/journals include:
IEEE TPAMI, IJCV, IEEE TIP, IEEE TNNLS, CVPR, NeurIPS, AAAI, IJCAI, ACM
MM, and more. His current research interests include developing an efficient vision model,
as well as information retrieval.
xi
xii About the Authors
Recently, we have witnessed a trend in deep learning in which models are rapidly increasing
in complexity [84, 211, 220, 90, 205, 286]. However, the host hardware where the models
are deployed has yet to keep up performance-wise due to practical limitations such as
latency, battery life, and temperature. It results in a large, ever-increasing gap between
computational demands and resources. To address this issue, network quantization [48,
199, 115, 149], which maps single-precision floating point weights or activations to lower
bits integers for compression and acceleration, has attracted considerable research attention.
The binary neural network (BNN) is the simplest version of low-bit networks and has gained
much attention due to its highly compressed parameters and activation features [48]. The
artificial intelligence company Xnor.ai is the most famous one focusing on BNNs. The
company, founded in 2016, raised a lot of money to build tools that help AI algorithms run
on devices rather than remote data centers. Apple Inc. bought the company and planned to
apply BNN technology on its devices to keep user information more private and speed-up
processing.
This chapter reviews recent advances in BNNs technologies well suited for front-end,
edge-based computing. We introduce and summarize existing works by classifying them
based on gradient approximation, quantization, architecture, loss functions, optimization
method, and binary neural architecture search. We also introduce computer vision and
speech recognition applications and discuss future applications of BNNs.
Deep learning has become increasingly important because of its superior performance.
Still, it suffers from a large memory footprint and high computational cost, making it dif-
ficult to deploy on front-end devices. For example, in unmanned systems, UAVs serve as
computing terminals with limited memory and computing resources, making it difficult
to perform real-time data processing based on convolutional neural networks (CNNs). To
improve storage and computation efficiency, BNNs have shown promise for practical ap-
plications. BNNs are neural networks where the weights are binarized. 1-bit CNNs are a
highly compressed version of BNNs that binarize both the weights and the activations to
decrease the model size and computational cost. These highly compressed models make
them suitable for front-end computing. In addition to these two, other quantizing neural
networks, such as pruning and sparse neural networks, are widely used in edge computing.
This chapter reviews the main advances of BNNs and 1-bit CNNs. Although binarization
operations can make neural networks more efficient, they almost always cause a significant
performance drop. In the last five years, many methods have been introduced to improve
the performance of BNNs. To better review these methods, we describe six aspects: gradient
approximation, quantization, structural design, loss design, optimization, and binary neural
architecture search. Finally, we will also review the object detection, object tracking, and
audio analysis applications of BNNs.
DOI: 10.1201/9781003376132-1 1
2 Introduction
TABLE 1.1
Results reported in BinaryConnect [48] and BinaryNet [99].
Method MNIST CIFAR-10
BinaryConnect (only binary weights) 1.29±0.08% 9.90%
BinaryNet (binary both weights and activations) 1.40% 10.15%
where ωb is the binarized weight and ω the real-valued weight. A second way is to binarize
scholastically:
+1, with probability p = σ(ω)
ωb = , (1.2)
−1, with probability 1 − p
where σ is the “hard sigmoid” function. The training process for these networks is slightly
different from full-precision neural networks. The forward propagation utilizes the binarized
weights instead of the full-precision weights, but the backward propagation is the same as
∂C
conventional methods. The gradient ∂ω b
needs to be calculated (C is the cost function) and
then combined with the learning rate to update the full-precision weights directly.
BinaryConnect only binarizes the weights, while BinaryNet [99] quantizes both the
weights and activations. BinaryNet also introduces two ways to constrain weights and ac-
tivations to be either +1 or −1, like BinaryConnect. BinaryNet also makes several changes
to adapt to binary activations. The first is shift-based Batch Normalization (SBN), which
avoids additional multiplications. The second is shift-based AdaMax instead of the ADAM
learning rule, which also decreases the number of multiplications. The one-third change is
to the operation to the input of the first layer. BinaryNet handles continuous-valued inputs
of the first layer as fixed-point numbers, with m bits of precision. Training neural networks
with extremely low-bit weights and activations were proposed as QNN [100]. As we are pri-
marily reviewing work on binary networks, the details of QNN are omitted here. The error
rates of these networks on representative datasets are shown in Table 1.1. However, these
two networks perform unsatisfactorily on larger datasets since weights constrained to +1
and −1 cannot be learned effectively. New methods for training [BNNs] and 1-bit networks
need to be raised.
Wang et al. [234] proposed Binarized Deep Neural Networks (BDNNs) for image clas-
sification tasks, where all the values and operations in the network are binarized. While
BinaryNet deals with CNNs, BDNNs target basic artificial neural networks consisting of
full-connection layers. Bitwise neural networks [117] also present a completely bitwise net-
work where all participating variables are bipolar binaries.
Principal Methods 3
gr = gq 1|r|≤1 , (1.3)
where 1|r|≤1 equals 1 when |r| ≤ 1. And it equals 0 in other cases. It can also be seen
as propagating the gradient through the hard tanh, which is a piecewise-linear activation
function.
The Bi-real Net [159] approximates the derivative of the sign function for activations.
Unlike using Htanh [99] to approximate the sign function, the Bi-real Net uses a piecewise
polynomial function for a better approximation.
Bi-real Net also proposes a magnitude-aware gradient for weights. When training BNNs,
∂C
the gradient ∂W is only related to the sign of weights and is independent of its magnitude.
So, the Bi-real Net replaces the sign function with a magnitude-aware function.
Xu et al. [266] use a higher-order approximation for weight binarization. They propose
a long-tailed approximation for activation binarization as a trade-off between tight approx-
imation and smooth backpropagation.
Differentiable Soft Quantization (DSQ) [74] also introduces a function to approximate
the standard binary and uniform quantization process called differentiable soft quantization.
DSQ employs hyperbolic tangent functions to gradually approach the staircase function for
low-bit quantization (sign function in 1-bit CNN). The binary DSQ function is as follows:
⎧
⎨ −1, x < −1
Qs (x) = 1, x>1 , (1.4)
⎩
stanh(kx), otherwise
with
1 2 1
k= log( − 1), s = . (1.5)
2 α 1−α
Especially when α is small, DSQ can closely approximate the uniform quantization
performance. This means that a suitable α will allow DSQ to help train a quantized model
with higher accuracy. Note that DSQ is differentiable, and thus the derivative of this function
can be used while updating the parameters directly.
According to the above methods, we can summarize that they all introduce a different
function to approximate the sign function in BinaryConnect so that the gradient to full-
precision weights or activations can be obtained more accurately. Therefore, the BNN or 1-
bit network converges easier in the training process, and the network performance improves.
1.1.3 Quantization
BinaryConnect and BinaryNet use simple quantization methods. After the full-precision
weights are updated, the new binary weights are generated by taking the sign of real-value
weights. But when the binary weights are decided only by the sign of full-precision weights,
4 Introduction
this may cause significant errors in quantization. Before introducing new methods to improve
the quantization process, we highlight the notations used in XNOR-Net [199] that will be
used in our discussions. For each layer in a CNN, I is the input, W is the weight filter, B is
the binarized weight (+ − 1), and H is the binarized input.
Rastegari et al. [199] propose Binary-Weight-Networks (BWN) and XNOR-Networks.
BWN approximates the weights with binary values, a variation of a BNN. XNOR-Networks
binarize both the weights and activation bits and is considered a 1-bit network. Both net-
works use the idea of a scaling factor. In BWN, the real-valued weight filter W is estimated
using a binary filter B and a scaling factor α. The convolutional operation is then approxi-
mated by:
I ∗ W ≈ (I ⊕ B)α, (1.6)
where ⊕ indicates a convolution without multiplication. By introducing the scaling factor,
binary weight filters reduce memory usage by a factor of 32× compared to single precision
filters. To ensure W is approximately equal to αB, BWN defines an optimization problem,
and the optimal solution is:
B ∗ = sign(W ), (1.7)
∗ W T sign(W ) |Wi | 1
α = = = Wr l1 . (1.8)
n n n
Therefore, the optimal estimation of a binary weight filter can be achieved by taking
the sign of weight values. The optimal scaling factor is the absolute average of the absolute
weight values. The scaling factor is also used to calculate the gradient in backpropagation.
The core idea of XNOR-Net is the same as BWN, but another scaling factor, β, is used
when binarizing the input I into H. As the experiments show, this approach outperforms
BinaryConnect and BNN by a large margin on ImageNet. Unlike the XNOR-Net, which
sets the mean weights to the scaling factor, Xu et al. [266] define a trainable scaring fac-
tor for both weights and activations. LQ-Nets [284] quantize both weights and activations
with arbitrary bit-widths, including 1-bit. The learnability of the quantizers makes them
compatible with bitwise operations to keep the fast inference merit of properly quantized
neural networks (QNNs).
Based on XNOR-Net [199], the High-Order Residual Quantization (HORQ) [138] pro-
vides a high-order binarization scheme, which achieves a more accurate approximation while
still having the advantage of binary operations. HORQ calculates the residual error and then
performs a new round of thresholding operations to approximate the residual further. This
binary approximation of the residual can be considered a higher-order binary input. Follow-
ing XNOR, HORQ defines the first-order residual tensor R1 (x) by computing the difference
between the real input and the first-order binary quantization:
R1 (x) = X − β1 H1 ≈ β2 H2 , (1.9)
where R1 (x) is a real value tensor. By this analogy, R2 (x) can be seen as the second-order
residual tensor, and β3 H3 also approximates it. After recursively performing the above
operations, they obtain order-K residual quantization:
K
X= β i Hi . (1.10)
i=1
During the training of the HORQ network, the input tensor can be reshaped into a
matrix and expressed as any order residual quantization. Experiments show that HORQ-
Net outperforms XNOR-Net in accuracy in the CIFAR dataset.
Principal Methods 5
ABC-Net [147] is another network designed to improve the performance of binary net-
works. ABC-Net approximates the full precision weight filter W with a linear combination
of M binary filters B1 , B2 , ..., BM ∈ {+1, −1} such that W ≈ α1 β1 + ... + αM βM . These
binary filters are fixed as follows:
where W̄ and std(W ) are the mean and standard derivation of W , respectively. For acti-
vations, ABC-Net employs multiple binary activations to alleviate information loss. Like
the binarization weights, the real activation I is estimated using a linear combination of N
activations A1 , A2 , ..., AN such that I = β1 A1 + ... + βN AN , where
FIGURE 1.1
Modulation process based on an M-Filter.
(e.g., sign function) for the reconstructed weights. While the reconstruction is binarized,
the computation in the latent factorized space is done in the real domain. This has several
advantages. First, the latent factorization enforces a coupling of filters before binarization,
which significantly improves the accuracy of trained models. Second, during training, the
binary weights of each convolutional layer are parametrized using a real-valued matrix or
tensor decomposition, while during inference, reconstructed (binary) weights are used.
Instead of using the same binary method for weights and activations, Huang et al. [93]
believe that the best performance for binarized neural networks can be obtained by applying
different quantization methods to weights and activations. They simultaneously binarize the
weights and quantize the activations to reduce bandwidth.
ReActNet [158] proposes a simple channel-wise reshaping and shifting operation for the
activation distribution, which replaces the sign function with ReAct-Sign, and replaces the
PReLU function with ReAct-PReLU. The parameters in ReAct-Sign and ReAct-PReLU
can be updated.
Compared to XNOR-Net [199], both HORQ-Net [138] and ABC-Net [147] use mul-
tiple binary weights and activations. As a result, HORQ-Net and ABC-Net outperform
XNOR-Net on binary tasks, but they also increase complexity, which goes against the ini-
tial intention of BNNs. New neural networks that perform better and retain the advantage
of speediness are waiting to be explored. MCN [236] and LBCNN [109] proposed new filters
while quantizing parameters and introducing a new loss function to learn these auxiliary
filters.
FIGURE 1.2
MCNs convolution.
Principal Methods 7
FIGURE 1.3
A block in XNOR-Net.
XNOR-Net [199] changes the block structure in a typical CNN. A typical block in a
CNN contains different layers: 1-Convolutional, 2-BatchNorm, 3-Activation, and 4-Pooling.
To further decrease information loss due to binarization, XNOR-Net normalizes the input
before binarization. This ensures the data have zero mean, so thresholding at zero minimizes
quantization error. The order of the layers in XNOR-Net is shown in Fig. 1.3.
The Bi-real Net [159] attributes the poor performance of 1-bit CNNs to their low rep-
resentation capacity. The representation capacity is defined as the number of all possible
configurations of x, where x could be a scalar, vector, matrix, or tensor. Bi-real Net pro-
poses a simple shortcut to preserve real activations before the sign function to increase
the representation capability of the 1-bit CNN. As shown in Fig. 1.4, the block indicates
the structure “Sign → 1-bit convolution → batch normalization → addition operator.” The
shortcut connects the input activations to the sign function in the current block to the
output activations after the batch normalization in the same block. These two activations
are added through an addition operator, and then the combined activations are passed to
the sign function in the next block.
The simple identity shortcut significantly enhances the representation capability of each
block in the 1-bit CNN. The only additional cost of computation is the addition operation
of two real activations without additional memory cost.
BinaryDenseNet [12] designs a new BNN architecture that addresses the main drawbacks
of BNNs. DenseNets [92] apply shortcut connections so that new information gained in one
layer can be reused throughout the depth of the network. This is a significant characteristic
that helps to maintain the information flow. The bottleneck design in DenseNets signifi-
cantly reduces the filters and values between layers, resulting in less information flow in the
BNNs. These bottlenecks must be eliminated. Due to the limited representation capacity
of binary layers, the DenseNet architecture does not perform satisfactorily. This problem is
solved by increasing the growth rate or using a larger number of blocks. To keep the number
FIGURE 1.4
1-bit CNN with shortcut.
8 Introduction
FIGURE 1.5
BinaryDenseNet.
of parameters equal for a given BinaryDenseNet, they halve the growth rate and double the
number of blocks simultaneously. The architecture of BinaryDenseNet is shown in Fig. 1.5
MeliusNet [10] presents a new architecture alternating with a DenseBlock, which in-
creases the feature capacity. They also propose an ImprovementBlock, which increases the
quality of the features. With this method, 1-bit CNNs can match the accuracy of the popular
compact network MobileNet-v1 in terms of model size, number of operations, and accuracy.
The building blocks of MeliusNet are shown in Fig. 1.6.
Group-Net [303] also improves the performance of 1-bit CNNs through structural design.
Inspired by a fixed number of binary digits representing a floating point number in a com-
puter, Group-Net proposes decomposing a network into binary structures while preserving
its representability rather than directly doing the quantization via ”value decomposition.”
Bulat et al. [25] are the first to study the effect of neural network binarization on lo-
calization tasks, such as human pose estimation and face alignment. They propose a novel
hierarchical, parallel, and multiscale residual architecture that significantly improves per-
formance over the standard bottleneck block while maintaining the number of parameters,
thus bridging the gap between the original network and its binarized counterpart. The new
architecture increases the size of the receptive field, as well as the gradient flow.
LightNN [57] replaces multiplications with one shift or a constrained number of shifts
and adds, which forms a new kind of model. The experiments show that LightNN has better
accuracy than BNNs, with only a slight increase in energy.
FIGURE 1.6
Building blocks of MeliusNet (c denotes the number of channels in the feature map).
Principal Methods 9
In this section, we list several works that modify the structure of [BNNs], contributing to
better performance or convergence of the network. XNOR-Net and Bi-real Net make minor
adjustments to the original networks, while MCN proposes new filters and convolutional
operations. The loss function is also adjusted according to the new filters, which will be
introduced in Section 1.1.5.
L = LM + LS . (1.13)
where C is the full precision weights, Ĉ is the binarized weights, M is the M-Filters defined
in Section 1.1.4, fm denotes the feature map of the last convolutional layer for the mth
sample, and f¯ denotes the class-specific mean feature map of previous samples. The first
entry of LM represents the filter loss, while the second entry calculates the center loss using
a conventional loss function, such as the softmax loss.
PCNNs [77] propose a projection loss for discrete backpropagation. It is the first to
define the quantization of the input variable as a projection onto a set to obtain a projec-
tion loss. Our BONNs [287] propose a Bayesian-optimized 1-bit CNN model to improve the
performance of 1-bit CNNs significantly. BONNs incorporate the prior distributions of full-
precision kernels, features, and filters into a Bayesian framework to construct 1-bit CNNs
comprehensively, end-to-end. They denote the quantization error as y and the full-precision
weights as x. They maximize p(x|y) to optimize x for quantization to minimize the recon-
structed error. This optimization problem can be converted to a maximum a posteriori since
the distribution of x is known. For feature quantization, the method is the same. Therefore,
the Bayesian loss is as follows:
Cl Cl
λ o
2
l i
LB = {k̂nl,i − wl ◦ knl,i 2
2 i=1 n=1
l=1
l,i
+ v(kn+ − μli+ )T (Ψli+ )−1 (kn+
l,i
− μli+ )
l,i
+ v(kn− − μli− )T (Ψli− )−1 (kn−
l,i
− μli− )
θ 2
M
vlog(det(Ψl ))} + {fm − cm
2 m=1
Nf
−2
+ σm,n (fm,n − cm,n )2 + log(σm,n
2
) }, (1.15)
n=1
10 Introduction
where k is the full precision kernels, w is the reconstructed matrix, v is the variance of y,
μ is the mean of the kernels, Ψ is the covariance of the kernels, fm are the features of class
m, and c is the mean of fm .
Zheng et al. [288] define a new quantization loss between binary weights and learned real
values, where they theoretically prove the necessity of minimizing the weight quantization
loss. Ding et al. [56] propose using distribution loss to explicitly regularize the activation
flow and develop a framework to formulate the loss systematically. Empirical results show
that the proposed distribution loss is robust to selecting the training hyper-parameters.
Reviewing these methods, they all aim to minimize the error and information loss of
quantization, which improves the compactness and capacity of 1-bit CNNs.
1.1.7 Optimization
Researchers also explore new training methods to improve BNN performance. These meth-
ods are designed to handle the drawbacks of BNNs. Some borrow popular techniques from
other fields and integrate them into BNNs, while others make changes based on classical
BNNs training, such as improving the optimizer.
Sari et al. [234] find that the BatchNorm layer plays a significant role in avoiding explod-
ing gradients, so the standard initialization methods developed for full-precision networks
are irrelevant for BNNs. They also break down BatchNorm components into centering and
Principal Methods 11
scaling, showing only minibatch centering is required. Their work provides valuable infor-
mation for research on the BNN training process. The experiments of Alizadeh et al. [2]
show that most of the tricks commonly used in training binary models, such as gradient
and weight clipping, are only required during the final stages of training to achieve the best
performance.
XNOR-Net++ [26] provides a new training algorithm for 1-bit CNNs based on XNOR-
Net. Compared to XNOR-Net, this new method combines activation and weight scaling
factors into a single scalar learned discriminatively through backpropagation. They also try
different ways to construct the shape of the scale factors on the premise that the computa-
tional budget remains fixed.
Borrowing an idea from the Alternating Direction Method of Multipliers (ADMM),
Leng et al. [128] decouple the continuous parameters from the discrete constraints of the
network and divide the original hard problem into several subproblems. These subproblems
are solved by extra gradient and iterative quantization algorithms, leading to considerably
faster convergence than conventional optimization methods.
Deterministic Binary Filters (DBFs) [225] learn weighted coefficients of predefined or-
thogonal binary bases instead of the conventional approach, which directly learns the con-
volutional filters. The filters are generated as a linear combination of orthogonal binary
codes and thus can be generated very efficiently in real time.
BWNH [91] trains binary weight networks by hashing. They first reveal the strong
connection between inner-product preserving hashing and binary weight networks, showing
that training binary weight networks can be intrinsically regarded as a hashing problem.
They propose an alternating optimization method to learn the hash codes instead of directly
learning binary weights.
CI-BCNN [239] learns BNNs with channel-wise interactions for efficient inference. Un-
like existing methods that directly apply XNOR and BITCOUNT operations, this method
learns interacted bitcount according to the mined channel-wise interactions. The incon-
sistent signs in binary feature maps are corrected based on prior knowledge provided by
channel-wise interactions so that the information of the input images is preserved in the
forward propagation of BNNs. Specifically, they employ a reinforcement learning model to
learn a directed acyclic graph for each convolutional layer, representing implicit channel-wise
interactions. They obtain the interacted bitcount by adjusting the output of the original
bitcount in line with the effects exerted by the graph. They train the BCNN and the graph
structure simultaneously.
BinaryRelax [272] is a two-phase algorithm to train CNNs with quantized weights, in-
cluding binary weights. They relax the hard constraint into a continuous regularizer via
Moreau envelope [176], the squared Euclidean distance to the set of quantized weights.
They gradually increase the regularization parameter to close the gap between the weights
and the quantized state. In the second phase, they introduce the exact quantization scheme
with a small learning rate to guarantee fully quantized weights.
CBCNs [149] propose new circulant filters (CiFs) and a circulant binary convolution
(CBConv) to enhance the capacity of binarized convolutional features through circulant
backpropagation. A CiF is a 4D tensor of size K × K × H × H, generated based on a
learned filter and a circulant transfer matrix M . The matrix M here rotates the learned
filter at different angles. The original 2D H ×H learned filter is modified to 3D by replicating
it three times and concatenating them to obtain 4D CiF, as shown in Fig. 1.7. The method
can improve the representation capacity of BNNs without changing the model size.
Rectified binary convolutional networks (RBCNs) [148] use a generative adversarial net-
work (GAN) to train the 1-bit binary network with the guidance of its corresponding full-
precision model, which significantly improves the performance of 1-bit CNNs. The rectified
convolutional layers are generic and flexible and can be easily incorporated into existing
DCNNs such as WideResNets and ResNets.
12 Introduction
FIGURE 1.7
The generation of CiF.
Martinez et al. [168] attempt to minimize the discrepancy between the binary output and
the corresponding real-valued convolution. They proposed real-to-binary attention matching
suited for training 1-bit CNNs. They also devised an approach in which the architectural gap
between real and binary networks is progressively bridged through a sequence of teacher-
student pairs.
Instead of using a pre-trained full-precision model, Bethge et al. [11] directly train a
binary network from scratch, which does not benefit from other standard methods. Their
implementation is based on the BMXNet framework [268].
Helwegen et al. [85] believe that latent weights with real values cannot be treated anal-
ogously to weights in real-valued networks, while their primary role is to provide inertia
during training. They introduced the Binary Optimizer (Bop), the first optimizer designed
for BNNs.
BinaryDuo [115] proposes a new training scheme for binary activation networks in which
two binary activations are coupled into a ternary activation during training. They first
decouple a ternary activation into two binary activations. Then the number of weights is
doubled after decoupling. They reduce the coupled ternary model to match the parameter
size of the decoupled model and the baseline model. They update each weight independently
to find a better value since the two weights no longer need to share the same value.
BENN [301] uses classical ensemble methods to improve the performance of 1-bit CNNs.
While ensemble techniques have been broadly believed to be only marginally helpful for
strong classifiers, such as deep neural networks, their analysis, and experiments show that
they are naturally a perfect fit to boost BNNs. The main uses of the ensemble strategies
are shown in [19, 32, 184].
TentacleNet [173] is also inspired by the theory of ensemble learning. Compared to
BENN [301], TentacleNet takes a step forward, showing that binary ensembles can reach
high accuracy with fewer resources.
BayesBiNN [170] uses a distribution over the binary variable, resulting in a principled
approach to discrete optimization. They used a Bernoulli approximation to the posterior
and estimated it using the Bayesian learning rule proposed in [112].
1.2 Applications
The success of BNNs makes it possible to apply deep learning models to edge computing.
Neural network models have been used in various real tasks with the help of these binary
methods, including image classification, image classification, speech recognition, and object
detection and tracking.
Applications 13
TABLE 1.2
Experimental results of some famous binary methods on ImageNet.
Binarized Acc. Full-precision Acc.
Methods Weights Activations Model
Top-1 Top-5 Top-1 Top-5
Bi-Real Net [159] Binary Binary ResNet-34 62.2 83.9 73.3 91.3
Liu et al. [148] experiment on object tracking after proposing RBCNs. They used the
SiamFC network as the backbone for object tracking and binarized the SiamFC as the
Rectified Binary Convolutional SiamFC Network (RB-SF). They evaluated RBSF in four
datasets, GOT-10K [94], OTB50 [250], OTB100 [251], and UAV123 [177], using accuracy
occupy (AO) and success rate (SR). The results are shown in Table 1.3.
Yang et al. [269] propose a new method to optimize a deep neural network based on
YOLO-based object tracking simultaneously using approximate weight binarization, train-
able threshold group binarization activation function, and separable convolution methods
according to depth, significantly reducing the complexity of computation and model size.
1.2.4 Applications
Other applications include face recognition and face alignment. Face recognition: Liu et al.
[160] apply Weight Binarization Cascade Convolution Neural Network to eye localization, a
face recognition field. BNNs here help reduce the storage size of the model, as well as speed
up calculation.
Face Alignment: Bulat et al. [25] test their method on three challenging datasets for
significant pose face alignment: AFLW [121], AFLW-PIFA [108], and AFLW2000-3D [302],
reporting in many cases state-of-the-art performance.
FIGURE 1.8
Our research agenda on BNNs.
lem in a theoretical framework. In ReBNNs, the reconstruction loss introduced in MCN can
theoretically decrease the gradient oscillation by changing its balanced factor.
Although the performance of BNNs has improved dramatically in the last three years,
the gap remains large compared to that of their full-precision counterparts. One possible
solution could come from the neural architecture search (NAS), which has led to state-of-
the-art performance in many learning tasks. Neural architecture search (NAS) has led to
state-of-the-art performance on many learning tasks. A natural idea is introducing NAS
into BNNs, leading to our binarized neural architecture search (BNAS) [35]. In our BNAS
framework, we show that the BNNs obtained by BNAS can outperform conventional models
by a large margin. While BNAS only focuses on kernel binarization to achieve 1-bit CNNs,
our CP-NAS [304] advances this work to binarize both weights and activations. In CP-NAS,
a Child-Parent (CP) model is introduced to a differentiable NAS to search the binarized
architecture (child) under the supervision of a full precision model (Parent). Based on CP-
NAS, we achieve much better performance than conventional binarized neural networks.
Our research agenda on BNNs is shown in Fig. 1.8.
2
Quantization of Neural Networks
Quantization is a strategy that has demonstrated outstanding and consistent success in both
the training and inference of neural networks (NN). NN present unique opportunities for
advancement even though the issues of numerical representation and quantization are as old
as digital computing. Although most of this quantization survey is concerned with inference,
it is essential to note that quantization has also been successful in NN training [8, 42, 63,
105]. Innovations in half-precision and mixed-precision training in particular [47, 80] have
enabled greater throughput in AI accelerators. However, going below half-precision without
significant tuning has proven to be challenging, and most recent quantization research has
concentrated on inference.
where qi represents the discrete quantization levels and Δi denotes the quantization steps.
When the value of a real number x falls between the quantization steps Δi − 1 and i + 1,
the quantizer Q projects it to the associated quantization level qi . It should be noted that
neither qi nor Δi are evenly spaced.
Nonuniform quantization can achieve higher accuracy for a fixed bit width because
it allows for better capturing of distributions by focusing on important value regions or
determining appropriate dynamic ranges. For example, various nonuniform quantization
techniques have been developed for bell-shaped distributions of weights and activations,
which often exhibit long tails. A commonly employed rule-based nonuniform quantization
method uses a logarithmic distribution, where the quantization steps and levels increase
exponentially rather than linearly.
Recent advances have approached it as an optimization problem to enhance quantization
performance. The goal is to minimize the difference between the original tensor and its
quantized counterpart by adjusting the quantization steps/levels in the quantizer qx .
Nonuniform quantization can also be improved by making the quantizer itself trainable.
These methods are called learnable quantizers, and the quantization steps/levels are opti-
mized through an iterative process or gradient descent along with the model parameters.
Overall, nonuniform quantization can better represent data by distributing bits and
unevenly discretizing the range of parameters. However, this quantization type can be chal-
lenging to implement effectively on standard computation hardware such as a GPU and
a CPU. As a result, uniform quantization remains the prevalent method because of its
straightforward implementation and efficient mapping to hardware.
2.2.1 Notations
The goal of quantization in deep networks is to reduce the precision of the weights and the
activations during the inference time to increase the computational efficiency. Given the data
to quantize v, the quantizer step size s, and the number of positive and negative quantization
levels (QP and QN ), a quantizer is used to compute v̂, a quantized representation on the
whole scale of the data, and v̂, a quantized representation of the data at the same scale as
v:
v̄ = clip(v/s, −QN , QP ) (2.8)
v̂ = v̄ × s (2.9)
LSQ: Learned Step Size Quantization 19
FIGURE 2.1
Computation of a low-precision convolution or fully connected layer, as envisioned here.
This technique uses low-precision inputs, represented by w̄ and x̄, in matrix multiplication
units for convolutional or fully connected layers in deep learning networks. The low-precision
integer matrix multiplication units can be computed efficiently, and a step size then scale
the output with a relatively low-cost high-precision scalar-tensor multiplication. This scaling
step has the potential to be combined with other operations, such as batch normalization,
through algebraic merging, as shown in Fig. 2.1. This approach minimizes the memory and
computational costs associated with matrix multiplication.
FIGURE 2.2
Given s = 1, QN = 0, QP = 3, A) quantizer output and B) gradients of the quantizer
output concerning step size, s, for LSQ, or a related parameter controlling the width of
the quantized domain (equal to s(QP + QN )) for QIL [110] and PACT [43]. The gradient
employed by LSQ is sensitive to the distance between v and each transition point, whereas
the gradient employed by QIL [110] is sensitive only to the distance from quantizer clip
points and the gradient employed by PACT [43] is zero everywhere below the clip point.
Here, we demonstrate that networks trained with the LSQ gradient reach a higher accuracy
than those trained with the QIL or PACT gradients in prior work.
2.2.4 Training
LSQ trains the model quantizers by making the step sizes learnable parameters, with the loss
gradient computed using the quantizer gradient mentioned earlier. In contrast, other model
parameters can be trained with conventional techniques. A common method of training
quantized networks [48] is employed where full precision weights are stored and updated,
while quantized weights and activations are used for forward and backward passes. The
gradient through the quantizer round function is calculated using the straight-through es-
timator [9] so that
∂v̂ 1, if − QN < v/s < Qp ,
= (2.12)
∂v 0, otherwise,
and stochastic gradient descent is used to update parameters.
Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer 21
FIGURE 2.3
The histogram of query values q (shadow) along with the PDF curve of Gaussian distri-
bution N (μ, σ 2 ) [195], for three selected layers in DeiT-T and 4-bit fully quantized DeiT-T
(baseline). μ and σ 2 are the statistical mean and variance of the values.
For ease of training, the input to the matrix multiplication layers is set to v̂, mathe-
matically equivalent to the inference operations described earlier. The input activations and
weights are set to 2, 3, 4, or 8 bits for all matrix multiplication layers except the first and
last, which are always set to 8 bits. This standard practice in quantized networks has been
shown to improve performance significantly. All other parameters are represented using
FP32. Quantized networks are initialized using weights from a trained full-precision model
with a similar architecture before being fine-tuned in the quantized space.
MHSA MHSA
Add & Norm Add & Norm
MLP MLP
FIGURE 2.4
Overview of Q-ViT, applying Information Rectification Module (IRM) for maximizing rep-
resentation information and Distribution Guided Distillation (DGD) for accurate optimiza-
tion.
inevitably deteriorates the attention module’s representation capability in capturing the in-
put’s global dependency. Second, the distillation for the fully quantized ViT baseline utilizes
a distillation token (following [224]) to directly supervise the quantized ViT classification
output. However, we found that such a simple supervision could be more effective, which
is coarse-grained because of the large gap between the quantized attention scores and their
full-precision counterparts.
To address the issues above, a fully quantized ViT (Q-ViT) [136] is developed by retain-
ing the distribution of quantized attention modules as that of full-precision counterparts (see
the overview in Fig. 2.4). Accordingly, we propose to modify the distorted distribution over
quantized attention modules through an Information Rectification Module (IRM) based
on information entropy maximization in the forward process. In the backward process, we
present a distribution-guided distillation (DGD) scheme to eliminate the distribution vari-
ation through attention similarity loss between the quantized ViT and the full-precision
counterpart.
Here, clip{y, r1 , r2 } returns y with values below r1 set as r1 and values above r2 set as r2 ,
and y rounds y to the nearest integer. With quantization of activations on signed a bits
and weights to signed b bits, Qxn = 2a−1 , Qxp = 2a−1 − 1 and Qw n =2
b−1
, Qw
p =2
b−1
− 1. In
general, the forward and backward propagation of the quantization function in the quantized
Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer 23
network is formulated as
Forward: Q-Linear(x) = x̂ · ŵ = αx αw ((Qa (x) + z/αx ) ⊗ Qw (w)),
⎧
∂J ∂ x̂ ⎨
∂J
∂J if x ∈ [−Qxn , Qxp ]
Backward: = = ∂ x̂ ,
∂x ∂ x̂ ∂x ⎩
0 otherwise (2.14)
⎧
∂J ∂x ∂ ŵ ⎨
∂J ∂x
∂J if w ∈ [−Qw w
n , Qp ]
= = ∂x ∂ ŵ ,
∂w ∂x ∂ ŵ ∂w ⎩
0 otherwise
where J is the loss function, Q(·) is applied in forward propagation. At the same time, the
straight-through estimator (STE) [9] is used to retain the gradient derivation in backward
propagation. ⊗ denotes the matrix multiplication with efficient bit-wise operations.
The input images are first encoded as patches and pass through several transformer
blocks. This transformer block consists of two components: Multi-Head Self-Attention
(MHSA) and Multi-Layer Perceptron (MLP). The computation of attention weight de-
pends on the corresponding query q, key k and value v, and the quantized computation in
one attention head is
where Q-Linearq , Q-Lineark , Q-Linearv denote the three quantized linear layers for q, k, v,
respectively. Thus, the attention weight is formulated as
1
A = √ (Qa (q) ⊗ Qa (k) ),
d (2.16)
QA = Qa (softmax(A)).
FIGURE 2.5
Analysis of bottlenecks from an architecture perspective. We report the accuracy of 2-
bit quantized DeiT-S and DeiT-B on the ImageNet data set to replace the full precision
structure.
only a drop of 1.78% and 4.26%, respectively. And once the query, key, value, and attention
weights are quantized, even with all weights of linear layers in the MHSA module in full
precision, the performance drops (10.57%) are still significant. Thus, improving the attention
structure is critical to solving the performance drop problem of quantized ViT.
Optimization bottleneck. We calculate l2-norm distances between each attention weight
among different blocks of the DeiT-S architecture as shown in Fig. 2.6. The MHSA modules
in full-precision ViT with different depths learn different representations from images. As
mentioned in [197], lower ViT layers pay more attention to global representations both
locally and globally. However, fully quantized ViT (blue lines in Fig. 2.6) fails to learn
accurate distances from the attention map. Therefore, it requires a new design to use full-
precision teacher information better.
Since weight and activation with a highly compressed bit width in fully quantized ViT
have limited capabilities, the ideal quantization process should preserve the corresponding
Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer 25
FIGURE 2.6
Attention-distance comparison for full-precision DeiT-Small, fully quantized DeiT-Small
baseline, and Q-ViT for the same input. Q-ViT shows similar behavior with the full-precision
model, while the baseline suffers indistinguishable attention distance for information degra-
dation.
full-precision counterparts as much as possible; thus, the mutual information between quan-
tized and full-precision representations [195]. As shown in [171], for the Gaussian distribu-
tion, the quantizers with the maximum output entropy (MOE) and the minimum aver-
age error (MAE) are approximately the same within a multiplicative constant. Therefore,
minimizing the error between the full precision and the quantized values is equivalent to
maximizing the information entropy of the quantized values. Thus, when the deterministic
quantization function is applied to quantized ViT, this objective is equivalent to maximiz-
ing the information entropy H(Qx ) of the quantized representation Qx [171] in Eq.(2.16),
FIGURE 2.7
The histogram of query and key values q, k (shadow) along with the PDF curve of Gaussian
distribution N (μ, σ 2 ) [195], for three selected layers in DeiT-T and 4-bit Q-ViT. μ and σ 2
are the statistical mean and variance of the values.
26 Quantization of Neural Networks
which is defined as
1
H(Qa (x)) = − p(qx ) log p(qx ) = log 2πeσx2 ,
qx
2
(2.19)
n ln 2 1
max H(Qa (x)) = n
, when p(qx ) = n ,
2 2
where qx are the random quantized variables in Qa (x) (which is Qa (q) or Qa (k) under
different conditions) with probability mass function p(·). The information entropy in the
quantization process should be maximized to retain the information contained in the MHSA
modules from their full-precision counterparts.
However, direct application of a quantization function that converts values into finite
fixed points brings about irreversible disturbance to the distributions and the information
entropy H(Qa (q)) and H(Qa (k)) degenerates to a much lower level than its full precision
counterparts. To mitigate the information degradation from the quantization process in the
attention mechanism, an Information Rectification Module (IRM) is proposed to effectively
maximize the information entropy of quantized attention weights.
q − μ(q) + βq k − μ(k) + βk
Qa (q̃) = Qa ( ), Qa (k̃) = Qa ( ), (2.20)
γq σ 2 (q) + q γk σ 2 (k) + k
where γq , βq and γk , βk are the learnable parameters to modify the distribution of q̃, while
q and k are constants that prevent the denominator from being 0. The learning rates of
the learnable γq , βq and γk , βk are the same as for the entire network. Thus, after IRM, the
information entropy H(Qa (q̃)) and H(Qa (k̃)) is formulated as
1 1
H(Q(q̃)) = log 2πe[γq2 (σq2 + q )], H(Q(k̃)) = log 2πe[γk2 (σk2 + k )]. (2.21)
2 2
Then, to revive the attention mechanism to capture critic elements by maximizing infor-
mation entropy, the learnable parameters γq , βq and γk , βk reshape the distributions of the
query and key values to achieve the maximum state of information. In a nutshell, in our
IRM-Attention structure, the information entropy of quantized attention weight is maxi-
mized to alleviate its severe information distortion and revive the attention mechanism.
where · 2 denotes 2 normalization and l, h are the layer index and the head index.
Previous work shows that matrices constructed in this way are regarded as specific patterns
that reflect the semantic understanding of the network [226]. And the patches encoded
from the input images contain a high-level understanding of parts, objects, and scenes [83].
Thus, such a semantic-level distillation target guides and meticulously supervises quantized
ViT. The corresponding G̃lqh ;T and G̃lkh ;T are constructed in the same way by the teacher’s
activation. Thus, combining the original distillation loss in Eq. (2.17), the final distillation
loss is formulated as
LDGD = G̃lqh ;T − G̃lqh 2 + G̃lkh ;T − G̃lkh 2 ,
l∈[1,L] h∈[1,H] (2.23)
Ldistillation = Ldist + LDGD ,
where L and H denote the number of ViT layers and heads. With the proposed Distribution-
Guided Distillation, Q-ViT retains the distribution over query and key from the full-
precision counterparts (as shown in Fig. 2.7).
Our DGD scheme first provides the distribution-aware optimization direction by pro-
cessing appropriate distilled parameters. Then it constructs similarity matrices to eliminate
scale differences and numerical instability, thereby improving fully quantized ViT by accu-
rate optimization.
FIGURE 2.8
The histogram of query values q (blue shadow) and corresponding PDF curves (red curve)
of Gaussian distribution [136], w.r.t the cross attention of different decoder layers in (a) real-
valued DETR-R50, and (b) 4-bit quantized DETR-R50 (baseline). Gaussian distribution is
generated from the statistical mean and variance of the query values. The query in quantized
DETR-R50 bears information distortion compared with the real-valued one. Experiments
are performed on the VOC dataset [62].
FIGURE 2.10
Overview of the proposed Q-DETR framework. We introduce the distribution rectification
distillation method (DRD) to refine the performance of Q-DETR. From left to right, we
respectively show the detailed decoder architecture of Q-DETR and the learning framework
of Q-DETR. The Q-Backbone, Q-Encoder, and Q-Decoder denote quantized architectures,
respectively.
inaccurate object localization. Therefore, a more generic method for DETR quantization is
necessary.
To tackle the issue above, we propose an efficient low-bit quantized DETR (Q-
DETR) [257] by rectifying the query information of the quantized DETR as that of the
real-valued counterpart. Figure 2.10 provides an overview of our Q-DETR, mainly accom-
plished by a distribution rectification knowledge distillation method (DRD). We find ineffec-
tive knowledge transferring from the real-valued teacher to the quantized student primarily
because of the information gap and distortion. Therefore, we formulate our DRD as a bi-level
optimization framework established on the information bottleneck principle (IB). Generally,
it includes an inner-level optimization to maximize the self-information entropy of student
queries and an upper-level optimization to minimize the conditional information entropy
between student and teacher queries. At the inner level, we conduct a distribution alignment
for the query guided by its Gaussian-alike distribution, as shown in Fig. 2.8, leading to an
explicit state in compliance with its maximum information entropy in the forward propaga-
tion. At the upper level, we introduce a new foreground-aware query matching that filters
out low-qualified student queries for exact one-to-one query matching between student and
teacher, providing valuable knowledge gradients to push minimum conditional information
entropy in the backward propagation.
where clip{y, r1 , r2 } clips the input y with value bounds r1 and r2 ; the y rounds y to
its nearest integer; the ◦ denotes the channel-wise multiplication. And Qxn = −2a−1 , Qxp =
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer 31
(1) Quantizing backbone (2) Quantizing encoder (3) Quantizing MHA of decoder (4) Quantizing MLPs
-0.8 -1.1
-2.1 -1.8
FIGURE 2.11
Performance of 3/4-bit quantized DETR-R50 on VOC with different quantized modules.
2a−1 − 1, Qwn = −2
b−1
, Qw
p = 2
b−1
− 1 are the discrete bounds for a-bit activations and
b-bit weights. x generally denotes the activation in this paper, including the input feature
map of convolution and fully-connected layers and input of multi-head attention modules.
Based on this, we first give the quantized fully-connected layer as:
Q-FC(x) = Qa (x) · Qw (w) = αx αw ◦ (xq wq + z/αx ◦ wq ), (2.25)
where · denotes the matrix multiplication and denotes the matrix multiplication with
efficient bit-wise operations. The straight-through estimator (STE) [9] is used to retain the
derivation of the gradient in backward propagation.
In DETR [31], the visual features generated by the backbone are augmented with posi-
tion embedding and fed into the transformer encoder. Given an encoder output E, DETR
performs co-attention between object queries O and the visual features E, which are for-
mulated as:
q = Q-FC(O), k, v = Q-FC(E)
√
Ai = softmax(Qa (q)i · Qa (k)
i / d),
(2.26)
Di = Qa (A)i · Qa (v)i ,
where D is the multi-head co-attention module, i.e., the co-attended feature for the object
query. The d denotes the feature dimension in each head. More FC layers transform the
decoder’s output features of each object query for the final output. Given box and class
predictions, the Hungarian algorithm [31] is applied between predictions and ground-truth
box annotations to identify the learning targets of each object query.
decoder module to low bits, i.e., (1)+(2)+(3), brings the most significant accuracy drops
of accuracy among all parts of the DETR methods, up to 2.1% in the 3-bit DETR-R50.
At the same time, other parts of DETR show comparative robustness to the quantization
function. Consequently, the critical problem of improving the quantized DETR methods is
restoring the information in MHA modules after quantization. Other qualitative results in
Fig. 2.8 and Fig. 2.9 also indicate that the degraded information representation is the main
obstacle to a better quantized DETR.
where qT and qS represent the queries in the teacher and student DETR methods as
predefined in Eq. (2.26); β and γ are the Lagrange multipliers [210]; θS is the parame-
ters of the student; and I(·) returns the mutual information of two input variables. The
first item I(X; ES ) minimizes information between input and visual features ES to extract
task-oriented hints [240]. The second item I(ES , qS ; y GT ) maximizes information between
extracted visual features and ground-truth labels for better object detection. Common net-
work training and detection loss constraints can easily accomplish these two items, such as
proposal classification and coordinate regression.
The core issue of this paper is to solve the third item I(qS ; qT ), which attempts to
address the information distortion in student query via introducing teacher query as a
priori knowledge. To accomplish our goal, we first expand the third item and reformulate
it as:
I(qS ; qT ) = H(qS ) − H(qS |qT ), (2.28)
where H(qS ) returns the self information entropy expected to be maximized while
H(qS |qT ) is the conditional entropy expected to be minimized. It is challenging to optimize
the above maximum and minimum items simultaneously. Instead, we make a compromise to
reformulate Eq. (2.28) as a bi-level issue [152, 46] that alternately optimizes the two items,
which is explicitly defined as:
∗
min H(qS |qT ),
θ
S∗ (2.29)
s. t. q = arg max H(qS ).
qS
However, an explicit form of H(qS ) can only be parameterized with a regular distribution
p(qSi ).
Luckily, the statistical results in Fig. 2.8 show that the query distribution tends to
follow a Gaussian distribution, also observed in [136]. This enables us to solve the inner-
level optimization in a distribution alignment fashion. To this end, we first calculate the
mean μ(qS ) and variance σ(qS ) of query qS whose distribution is then modeled as qS ∼
N (μ(qS ), σ(qS )). Then, the self-information entropy of the student query can proceed as:
a zero denominator. The mean and variance might be inaccurate in practice due to query
data bias. To solve this, we use the concepts in batch normalization (BN) [207, 102] where
a learnable shifting parameter βqS is added to move the mean value. A learnable scaling
parameter γqS is multiplied to move the query to the adaptive position. In this situation,
we rectify the information entropy of the query in the student as follows:
∗ qS − μ(qS )
qS = γ qS + β qS , (2.32)
2
σ(qS ) + qS
∗
in which case the maximum self-information entropy of student query becomes H(qS ) =
(1/2) log 2πe[(σq2 S + qS )/γq2S ]. Therefore, in the forward propagation, we can obtain the
∗
current optimal query qS via Eq. (2.32), after which, the upper-level optimization is further
executed as detailed in the following contents.
Upper-level optimization. We continue minimizing the conditional information en-
tropy between the student and the teacher. Following DETR [31], we denote the ground-
GT Ngt
truth labels by y GT = {cGTi , bi }i=1 as a set of ground-truth objects where Ngt is the num-
ber of foregrounds, cGT i and bGTi respectively represent the class and coordinate (bounding
box) for the i-th object. In DETR, each query is associated with an object. Therefore, we
can obtain N objects for teacher and student as well, denoted as y S = {cSj , bSj }N j=1 and
y T = {cTj , bTj }N
j=1 .
The minimization of the conditional information entropy requires the student and
teacher objects to be in a one-to-one matching. However, it is problematic for DETR due
primarily to the sparsity of prediction results and the instability of the query’s predic-
tions [129]. To solve this, we propose a foreground-aware query matching to rectify “well-
matched” queries. Concretely, we match the ground-truth bounding boxes with this student
to find the maximum coincidence as:
S
Gi = max GIoU(bGT
i , bj ), (2.33)
1≤j≤N
34 Quantization of Neural Networks
where GIoU(·) is the generalized intersection over union function [202]. Each Gi reflects the
“closeness” of student proposals to the i-th ground-truth object. Then, we retain highly qual-
ified student proposals around at least one ground truth to benefit object recognition [235]
as:
S
bj , GIoU(bGT S
i , bj ) > τ G i , ∀ i
bSj = (2.34)
∅, otherwise,
where τ is a threshold controlling the proportion of distilled queries. After removing object-
empty (∅) queries in q̃ S , we form a distillation-desired query set of students denoted as q̃ S
associated with its object set ỹ S = {c̃Sj , b̃Sj }Ñ
j=1 . Correspondingly, we can obtain a teacher
query set ỹ T = {c̃Tj , b̃Tj }Ñj=1 . For the j-th student query, its corresponding teacher query is
matched as:
N
c̃Tj , b̃Tj = arg max μ1 GIoU(b̃Sj , bTk ) − μ2 b̃Sj − bTk 1 , (2.35)
c̃T T
k ,b̃k k=1
where μ1 = 2 and μ2 = 5 control the matching function, values of which is to follow [31].
Finally, the upper-level optimization after rectification in Eq. (2.29) becomes:
∗
min H(q̃S |q̃T ). (2.36)
θ
Optimizing Eq. (2.36) is challenging. Alternatively, we minimize the norm distance be-
∗ ∗
tween q̃S and q̃T , optima of which, i.e., q̃S = q̃T , is exactly the same with that in
Eq. (2.36). Thus, the final loss for our distribution rectification distillation loss becomes:
∗ ∗
LDRD (q̃S , q̃T ) = E[D̃S − D̃T 2 ], (2.37)
where we use the Euclidean distance of co-attented feature D̃ (see Eq. 2.26) containing the
information query q̃ for optimization.
In backward propagation, the gradient updating drives the student queries toward their
teacher hints. Therefore, we accomplish our distillation. The overall training losses for our
Q-DETR model are:
∗
L = LGT (y GT , y S ) + λLDRD (q̃S , q̃T ), (2.38)
where LGT is the common detection loss for missions such as proposal classification and
coordinate regression [31], and λ is a trade-off hyper-parameter.
FIGURE 2.12
(a) We select τ and λ using 4-bit Q-DETR-R50 on VOC. (b) The mutual information curves
of I(X; E) and I(y GT ; E, q) (Eq. 2.27) on the information plane. The red curves represent
the teacher model (DETR-R101). The orange, green, red, and purple lines represent the
4-bit baseline, 4-bit baseline + DA, 4-bit baseline + FQM, and 4-bit baseline + DA +
FQM (4-bit Q-DETR).
memory. We use ImageNet ILSVRC12 [123] to pre-train the backbone of a quantized stu-
dent. The training protocol is the same as the employed frameworks [31, 70]. Specifically,
we use a batch size of 16. AdamW [164] is used to optimize the Q-DETR, with the ini-
tial learning rate of 1e−4 . We train for 300/500 epochs for the Q-DETR on VOC/COCO
dataset, and the learning rate is multiplied by 0.1 at the 200/400-th epoch, respectively.
Following the SMCA-DETR, we train the Q-SMCA-DETR for 50 epochs, and the learning
rate is multiplied by 0.1 at the 40th epoch on both the VOC and COCO datasets. We utilize
a multi-distillation strategy, saving the encoder and decoder network as real-valued at the
first stage. Then we train the fully quantized DETR at the second stage, where we load
the weight from the checkpoint of the first stage. We select real-valued DETR-R101 (84.5%
AP50 on VOC and 43.5% AP on COCO) and SMCA-DETR-R101 (85.3% AP50 on VOC
and 44.4% AP on COCO) as teacher network.
Hyper-parameter selection. As mentioned, we select hyper-parameters τ and λ in
this part using the 4-bit Q-DETR model. We show the model performance (AP50 ) with
different setups of hyper-parameters {τ, λ} in Fig. 2.12 (a), where we conduct ablative ex-
periments on the baseline + DA (AP50 =78.8%). As can be seen, the performances increase
first and then decrease with the increase of τ from left to right. Since τ controls the propor-
tion of selected distillation-desired queries, we show that the full-imitation (τ = 0) performs
worse than the vanilla baseline with no distillation (τ = 1), showing query selection is nec-
essary. The figure also shows that the performances increase first and then decrease with
the increase of τ from left to right. The Q-DETR performs better with τ set as 0.5 and 0.6.
With the varying value of λ, we find {λ, τ } = {2.5, 0.6} boost the performance of Q-DETR
most, achieving 82.7% AP on VOC test2007. Based on the ablative study above, we set
hyper-parameters τ and λ as 0.6 and 2.5, respectively, for the experiments in this paper.
Effectiveness of components. We show quantitative component improvements in Q-
DETR in Table 2.2. As shown in Table 2.2, the quantized DETR baseline suffers a severe
performance drop on AP50 (13.6%, 6.5%, and 5.3% with 2/3/4-bit, respectively). DA and
FQM improve the performance when used alone, and the two techniques further boost the
performance considerably when combined. For example, the DA improves the 2-bit baseline
36 Quantization of Neural Networks
TABLE 2.2
Evaluating the components of Q-DETR-R50 on the VOC dataset.
Method #Bits AP50 #Bits AP50 #Bits AP50
Real-valued 32-32-32 83.3 - - - -
Baseline 4-4-8 78.0 3-3-8 76.8 2-2-8 69.7
+DA 4-4-8 78.8 3-3-8 78.0 2-2-8 71.6
+FQM 4-4-8 81.5 3-3-8 80.9 2-2-8 74.9
+DA+FQM (Q-DETR) 4-4-8 82.7 3-3-8 82.1 2-2-8 76.4
Note: #Bits (W-A-Attention) denotes the bit-width of weights, activations, and attention
activations. DA denotes the distribution alignment module. FQM denotes foreground-aware
query matching.
by 1.9%, and the FQM achieves a 5.2% performance improvement. While combining the
DA and FQM, the performance improvement achieves 6.7%.
Information analysis. We further show the information plane following [238] in
Fig. 2.12. We adopt the test AP50 to quantify I(y GT ; E, q). We employ a reconstruction
decoder to decode the encoded feature E to reconstruct the input and quantify I(X; E)
using the 1 loss. As shown in Fig. 2.12, the curve of the larger teacher DETR-R101 is
usually on the right of the curve of small student models, which indicates a greater ability
of information representation. Likewise, the purple line (Q-DETR-R50) is usually on the
right of the three left curves, showing the information representation improvements with
the proposed methods.
3
Algorithms for Binary Neural
Networks
3.1 Overview
The most extreme quantization in the quantization area is binarization, which is the focus
of this book. Data can only have one of two potential values during binarization, which
is a 1-bit quantization: −1 (or 0) or +1. Both weight and activation can be represented
by a single bit in network compression without consuming a lot of memory. In addition,
binarization replaces costly matrix multiplication operations with lighter bitwise XNOR
and Bitcount operations. Therefore, compared to alternative compression techniques, binary
neural networks (BNNs) have a variety of hardware-friendly advantages, such as significant
acceleration, memory savings, and power efficiency. The usefulness of binarization has been
demonstrated by ground-breaking work like BNN [99] and XNOR-Net [199], with XNOR-
Net being able to speed up CPUs by 58% and save up to 32 bytes of RAM for a 1-bit
convolution layer. Following the BNN paradigm, a lot of research has been done on this
topic in recent years from the field of computer vision and machine learning [84, 201, 153],
and it has been used for a variety of everyday tasks including image classification [48,
199, 159, 196, 267, 259], detection [263, 240, 264, 260], point cloud processing [194, 261],
object reidentification [262], etc. By transforming a layer from full precision to 1-bit, the
binarization approach intuitively makes it simple to verify the significance of a layer. If
performance suffers noticeably after binarizing a particular layer, we can infer that this layer
is on the network’s sensitive path. From the perspective of explainable machine learning, it
is also essential to determine if full-precision and binarized models operate similarly.
Numerous researchers have sought to shed light on the behaviors of model binarization,
as well as the relationships between the robustness of the model and the architecture of
deep neural networks, in addition to concentrating on the methods of model binarization.
This may aid in approaching solutions to fundamental queries of what network topology
is preferable and how the deep network functions. It is crucial to thoroughly explore BNN
studies because they will help us better understand the behaviors and architectures of
effective and reliable deep learning models. Some outstanding prior art reveals how BNN’s
components work. For example, Bi-Real Net [159] incorporates more shortcuts (Bi-Real)
to mitigate the information loss caused by binarization. This structure functions similarly
to the ResNet shortcut [84], which helps to explain why commonly used shortcuts can
somewhat improve the performance of deep neural networks. One thing that can be observed
by looking at the activations is that more specific information from the shallow layer can be
transmitted to the deeper layer during forward propagation. On the other hand, to avoid
the gradient vanishing problem, gradients can be directly propagated backward using the
shortcut. By building numerous weak classifier groups, some ensemble approaches [301]
improve BNN performance but occasionally run into overfitting issues. Based on analysis
and testing with BNNs, they demonstrated that the number of neurons trumps bit width
DOI: 10.1201/9781003376132-3 37
38 Algorithms for Binary Neural Networks
and that real-valued neurons may not even be required in deep neural networks, which is
comparable to the idea behind biological neural networks.
Additionally, an efficient method to examine the interpretability of deep neural networks
is to reduce the bit width of a particular layer and examine its impact on accuracy. Numerous
works [199, 159] investigate how sensitive various layers are to binarization. In common
BNNs, the first and last layers should, by default, be kept at higher precision. This means
that these layers are more crucial for predicting neural networks. This section attempts to
state the nature of binary neural networks by introducing some representative work.
where ⊗ represents the convolution operation. In this book, we omit the non-linear function
for simplicity. Following the prior works [48, 99], BNN intends to represent wn and an in a
binary discrete set as:
B := {−1(0), +1}.
Thus, the 1-bit format of wn and an is respectively bw ∈ BCout ×Cin ×K ×K and bain ∈
n n n n n n
BCin ×Win ×Hin such that the efficient XNOR and Bit-count instructions can approximate
n n n
where ◦ represents channel-wise multiplication and denotes XNOR and Bit-count instruc-
tions.
However, this quantization mode will cause the output amplitude to increase dramati-
cally, different from the full precision convolution calculation, and cause the homogenization
of characteristics [199]. Several novel objects are proposed to address this issue, which will
be introduced in the following.
The XNOR-Net binarization approach seeks to identify the most accurate convolutional
approximations. Specifically, XNOR-Net employs a scaling factor, which plays a vital role
in the learning of BNNs, and improves the forward pass of BNNs as:
n n
anout ≈ αn ◦ (bw bain ), (3.3)
Cn
where α = n
{α1n , α2n , ..., αC
∈ n
n }
out
R+out
is known as the channel-wise scaling factor vector
to mitigate the output gap between Eq. (3.1) and its approximation of Eq. (3.3). We denote
A = {αn }Nn=1 . Since the weight values are binary, XNOR-Net can implement the convolu-
tion with additions and subtractions. In the following, we state the XNOR operation for a
specific convolution layer, thus omitting the superscript n for simplicity. Most existing im-
plementations simply follow earlier studies [199, 159]to optimize A based on non-parametric
optimization as:
α∗ , bw ∗ = arg min J(α, bw ), (3.4)
α,bw
Feature
Maps
forward propagation
Unbinarized
backward propagation
Filters
Binarize
Binarized
M-Filters Filter Loss
Filters
FIGURE 3.1
The overall frameworks of Modulated Convolutional Networks (MCNs).
creased. In particular, to alleviate the disturbance caused by the binarized process, a center
loss is designed to incorporate the intraclass compactness with the quantization loss and
filter loss. The red arrows are used to show the back-propagation process. By considering
filter loss, center loss, and softmax loss in a unified framework, we achieve much better
performance than state-of-the-art binarized models. Most importantly, our MCNs model
is highly compressed and performs similarly to the well-known full-precision Resnets and
WideResnets.
As shown in Fig. 3.1, M-Filters and weights can be jointly optimized end-to-end, resulting
in a compact and portable learning architecture. Due to the low model complexity, such an
architecture is less prone to overfitting and is suitable for resource-constrained environments.
Specifically, our MCNs reduce the required storage space of a full-precision model by a
factor of 32 while achieving the best performance compared to the existing binarized filter-
based CNNs, even approximating full-precision filters. In addition, the number of model
parameters to be optimized is significantly reduced, thus generating a computationally
efficient CNNs model.
K
Ĉi ◦ M = Ĉi ∗ Mj , (3.12)
j
where Mj = (Mj , ..., Mj ) is a 3D matrix built based on K copies of the 2D matrix Mj with
j = 1, ..., K. ∗ is the element-wise multiplication operator, also termed the Schur product
operation. In Eq. 3.12, M is a learned weight matrix used to reconstruct convolutional filters
Ci based on Ĉi and the operation ◦. And it leads to the filter loss in Eq. 3.18. An example
of filter modulation is shown in Fig. 3.2(a). In addition, the operation ◦ results in a new
matrix (named reconstructed filter), i.e., Ĉi ∗ Mj , which is elaborated in the following. We
define:
Qij = Ĉi ∗ Mj , (3.13)
C Ĉ M Q
FIGURE 3.2
(a) The modulation process based on an M-Filter to obtain a reconstructed Filter Q. (b)
An example of MCNs convolution with K = 4 planes. The number of planes of the M-Filter
is the same as the number of channels of the feature map. This chapter defines a feature
map as a 3D matrix with four channels.
In MCNs, reconstructed filters Ql in the lth layer are used to calculate output feature
maps F l+1 as:
where M Cconv denotes the convolution operation implemented as a new module. A simple
example of the forward convolutional process is described in Fig. 3.2(b), where there is one
input feature map with one generated output feature map. In MCconv, the channels of one
output feature map are generated as follows:
l+1
Fh,k = Fgl ⊗ Qlik , (3.16)
i,g
Fhl+1 = (Fh,1
l+1 l+1
, ..., Fh,K ), (3.17)
10
h=2
20 20
g=10
Groups
h=20
10
FIGURE 3.3
MCNs Convolution (MCconv) with multiple feature maps. There are 10 and 20 feature
maps in the input and the output, respectively. The reconstructed filters are divided into
20 groups, and each group contains 10 reconstructed filters, corresponding to the number
of feature maps and MC feature maps, respectively.
map, h = 1, i = 1, ..., 10, g = 1, ..., 10, and for the second output feature map, h = 2, i =
11, ..., 20, g = 1, ..., 10.
When the first convolutional layer is considered, the input size of the network is
32 × 32 2 . First, each image channel is copied K = 4 times, resulting in the new input
of size 4 × 32 × 32 to the entire network.
It should be noted that the number of input and output channels in every feature map
is the same, so MCNs can be easily implemented by simply replicating the same MCconv
module at each layer.
where θ and λ are hyper parameters, M = {M 1 , ..., M N } are M-Filters, and Ĉ is the
binarized filter set across all layers. Operation ◦ defined in Eq. 3.12 is used to approximate
unbinarized filters based on binarized filters and M-Filters, leading to filter loss as the first
term on the right of Eq. 3.18. The second term on the right is similar to the center loss
used to evaluate intraclass compactness, which deals with the feature variation caused by
the binarization process. fm (Ĉ, M ) denotes the feature map of the last convolutional layer
for the mth sample, and f (Ĉ, M ) denotes the class-specific mean feature map of previous
samples. We note that the center loss is successfully deployed to handle feature variations.
We only keep the binarized filters and the shared M-Filters (quite small) to reduce the
storage space to calculate the feature maps after training. We consider the conventional
loss and then define a new loss function LS,M = LS + LM , where LS is the conventional
loss function, e.g., softmax loss.
Again, we consider the quantization process in our loss LS,M , and obtain the final
minimization objective as:
θ [k] [k]
L(C, Ĉ, M ) = LS,M + C − C − ηδC 2 , (3.19)
2
[k]
where θ is shared with Eq. 3.18 to reduce the number of parameters. δC is the gradient
of LS,M with respect to C [k] . Unlike conventional methods (such as XNOR), where only
the filter reconstruction is considered in the weight calculation, our discrete optimization
method provides a comprehensive way to calculate binarized CNNs by considering filter
loss, softmax loss, and feature compactness in a unified framework.
where L, LS , and LM are loss functions, and η1 is the learning rate. Furthermore, we have
the following.
∂LS ∂LS ∂Q ∂LS
= · = · Mj , (3.22)
∂ Ĉi ∂Q ∂ Ĉi j
∂Qij
∂LM
=θ (Ci − Ĉi ◦ Mj ) ◦ Mj , (3.23)
∂ Ĉi j
Algorithm 1 MCN training. L is the loss function, Q is the reconstructed filter, λ1 and λ2
are decay factors, and N is the number of layers. Update() updates the parameters based
on our update scheme.
Input: a minibatch of inputs and their labels, unbinarized filters C, modulation filters M ,
learning rates η1 and η2 , corresponding to C and M , respectively.
Output: updated unbinarized filters C t+1 , updated modulation filters M t+1 , and updated
learning rates η1t+1 and η2t+1 .
1: {1. Computing gradients with aspect to the parameters:}
2: {1.1. Forward propagation:}
3: for k =1 to N do
4: Ĉ ← Binarize(C)
5: Computing Q via Eq. 3.13 ∼ 3.14
6: Convolutional features calculation using Eq. 3.15 ∼ 3.17
7: end for
8: {1.2. Backward propagation:}
9: {Note that the gradients are not binary.}
∂L
10: Computing δQ = ∂Q
11: for k =N to 1 do
12: Computing δĈ using Eq. 3.20, Eq. 3.22 ∼ 3.23
13: Computing δM using Eq. 3.24, Eq. 3.26 ∼ 3.27
14: end for
15: {Accumulating the parameters gradients:}
16: for k = 1 to N do
17: C t+1 ← Update(δĈ , η1 ) (using Eq. 3.21)
18: M t+1 ← Update(δM , η2 ) (using Eq. 3.25)
19: η1t+1 ← λ1 η1
20: η2t+1 ← λ2 η2
21: end for
Details about the derivatives concerning center loss can be found in [245]. These deriva-
tions show that MCNs can be learned with the BP algorithm. The quantization process leads
to a new loss function via a simple projection function, which never affects the convergence
of MCNs. We describe our algorithm in Algorithm 1.
95
90
85
80
2 3 4
Number of clustering centers
FIGURE 3.4
Accuracy with different numbers of clustering centers for 20-layer MCNs with width 16-16-
32-64.
with a batch size of 128. Using different values of θ, the performance of MCNs is shown in
Fig. 3.7. First, only the effect of θ is evaluated. Then the center loss is implemented based
on a fine-tuning process. Performance is observed to be stable with variations θ and λ.
The number of clustering centers: We show the quantization with U = 2, 3, 4 denoting
the numbers of clustering centers. In this experiment, we investigate the effect of varying
the number of clustering centers in MCNs based on CIFAR-10.
The results are shown in Fig. 3.4, where accuracy increases with more clustering centers
and center loss can also be used to improve performance. However, to save storage space
and to compare with other binary networks, we use two clustering centers for MCNs in all
the following experiments.
Our binarized networks can save storage space by 32 in convolutional layers compared
with the corresponding full-precision networks, where 4 bytes (32 bits) represent a real
value. Since MCNs only contain one fully connected layer that is not binarized, the storage
of the whole network is significantly reduced.
The architecture parameter K: The number of planes for each M-Filter, i.e., K, is also
evaluated. As revealed by the results in Fig. 3.5, more planes in each M-filter involved in
reconstructing the unbinarized filters yield better performance. For example, when increas-
ing K from 4 to 8, the performance is improved by 1.02%. For simplicity, we choose K = 4
in the following experiments.
The width of MCNs: CIFAR-10 is used to evaluate the effect of the width of Wide-
ResNets with MCNs. The accuracy and number of parameters are compared with a recent
binary CNN, LBCNN. The basic width of the stage (the number of convolution kernels
per layer) is set to 16 − 16 − 32 − 64. To compare with LBCNN, we set up 20-layer MCNs
with basic block-c (in Fig. 3.9), whose depth is the same as in LBCNN. We also use other
network widths to evaluate the effect of width on MCNs.
The results are shown in Table 3.1. The second column refers to the width of each layer
of the MCNs, and a similar notation is also used in [281]. In the third column, we give the
parameter amounts of MCNs and the 20-layer LBCNN with the best result. The fourth
column shows the accuracy of baselines whose networks are trained based on the Wide-
ResNets (WRNs) structure with the same depth and width as the MCNs. The last two
MCN: Modulated Convolutional Network 47
FIGURE 3.5
Accuracy with different K for 20-layer MCNs with width 16-16-32-64 on CIFAR-10.
columns show the accuracies of U-MCNs and MCNs, respectively. The performance in the
last three columns shows that the accuracy of MCNs only decreases slightly when binarized
filters are used. Note that with a fixed number of convolutional layers, the performance of
MCNs increases with larger network width. At the same time, the number of parameters
also increases. Compared to LBCNN, the parameters of the MCNs are much fewer (61 M
vs. 17.2 M), but the performance of the MCNs is much better (92.96% vs. 95.30%). Also,
the last three columns show that MCNs have achieved performance similar to U-MCNs and
WRNs.
FIGURE 3.6
Network architectures of CNNs and MCNs.
48 Algorithms for Binary Neural Networks
FIGURE 3.7
Accuracy with different θ and λ.
TABLE 3.1
Classification accuracy (%) on CIFAR-10 with 20-layer U-MCNs and MCNs.
Method Kernel Stage Size (MB) WRNs U-MCNs MCNs MCNs-1
16-16-32-64 1.1 92.31 93.69 92.08 92.10
16-32-64-128 4.3 – 94.88 93.98 93.94
MCNs 32-64-128-256 17.1 – 95.50 95.13 95.33
64-64-128-256 17.2 95.75 95.72 95.30 95.34
LBCNN (q=384) – 61 – – 92.96 –
respectively, with similar accuracy (93.98% vs. 92.96%). When LBCNN has several param-
eters (4.3M) similar to those of the MCNs, the test run time of LBCNN becomes 16.2 s,
which is still slower than our MCNs.
Visualization: We visualize MCconv features in Fig. 3.8 across different layers and the
curves of elements in different M-Filters in Fig. 3.11. Similarly to conventional CNNs,
the features of different layers capture rich and hierarchy information in Fig. 3.8. Based
on the reconstructed filters Q corresponding to the M-Filters, we obtain convolutional fea-
tures that appear diverse for different M-Filters. In summary, different MCconv layers and
FIGURE 3.8
Example of output feature maps produced by Q from different layers.
PCNN: Projection Convolutional Neural Networks 49
xl xl xl
Conv 1×1
Conv 3×3 MCconv 3×3
Conv 3×3
Conv 3×3 MCconv 3×3
Conv 1×1
+ + +
xl 1 xl 1 xl 1
(a) Wide-Resnet (b) Wide-Resnet
(c) MCN basic
basic block bottleneck
FIGURE 3.9
Residual blocks. (a) and (b) are for Wide-ResNets. (c) A basic block for MCNs.
M-Filters can capture the hierarchy and diverse information, which thus results in a high
performance based on compressed models. Figure 3.11 show the curves of the elements in
M-Filter 1 (M1 ), M-Filter 2 (M2 ), M-Filter 3 (M3 ), and M-Filter 4 (M4 ) (in Fig. 3.2(a)
and Eq. 3.12) on the CIFAR experiment. The values of nine elements in each M-Filter are
learned similarly to their averages (dotted lines). This validates that the special MCNs-1
with a single average element in each Mj matrix is reasonable and compact without perfor-
mance loss.
FIGURE 3.10
Training and testing curves.
50 Algorithms for Binary Neural Networks
FIGURE 3.11
The curves of elements in M-Filter 1 (M1 ), M-Filter 2 (M2 ), M-Filter 3 (M3 ), and M-Filter
4 (M4 ) (in Fig. 3.2(a) and Eq. 3.12) on the CIFAR experiment in the training process. The
values of the nine elements in each M-Filter are learned similarly to their averages (dotted
lines). This validates that the special MCNs-1 with a single average element in each Mj
matrix is reasonable and compact without large performance loss.
reconstructing full-precision convolutional filters from binarized filters, limiting their use in
computationally limited environments. It has been theoretically and quantitatively demon-
strated that simplifying the convolution procedure via binarized kernels and approximating
the original unbinarized kernels is a very promising solution for DCNNs’ compression.
Although prior BNNs significantly reduce storage requirements, they also generally have
significant accuracy degradation compared to those using full-precision kernels and activa-
tions. This is mainly because CNN binarization could be solved by considering discrete
optimization in the backpropagation (BP) process. Discrete optimization methods can of-
ten guarantee the quality of the solutions they find and lead to much better performance in
practice [66, 119, 127]. Second, the loss caused by the binarization of CNNs has not been
well studied.
We propose a new discrete backpropagation via projection (DBPP) algorithm to effi-
ciently build our projection convolutional neural networks (PCNNs) [77] and obtain highly
accurate yet robust BNNs. Theoretically, we achieve a projection loss by taking advantage
of our DBPP algorithms’ ability to perform discrete optimization on model compression.
The advantages of the projection loss also lie in that it can be jointly learned with the
conventional cross-entropy loss in the same pipeline as backpropagation. The two losses
are simultaneously optimized in continuous and discrete spaces, optimally combined by the
projection approach in a theoretical framework. They can enrich the diversity and thus
improve modeling capacity. As shown in Fig.3.12, we develop a generic projection convolu-
tion layer that can be used in existing convolutional networks. Both the quantized kernels
and the projection are jointly optimized in an end-to-end manner. Our project matrices are
optimized but not for reference, resulting in a compact and efficient learning architecture.
As a general framework, other loss functions (e.g., center loss) can also be used to further
improve the performance of our PCNNs based on a progressive optimization method.
Discrete optimization is one of the hot topics in mathematics and is widely used to solve
computer vision problems [119, 127]. Conventionally, the discrete optimization problem is
solved by searching for an optimal set of discrete values concerning minimizing a loss func-
tion. This chapter proposes a new discrete backpropagation algorithm that uses a projection
function to binarize or quantize the input variables in a unified framework. Due to the flex-
PCNN: Projection Convolutional Neural Networks 51
FIGURE 3.12
In PCNNs, a new discrete backpropagation via projection is proposed to build binarized neu-
ral networks in an end-to-end manner. Full-precision convolutional kernels Cil are quantized
l
by projection as Ĉi,j . Due to multiple projections, the diversity is enriched. The resulting
l
kernel tensor Di is used the same as in conventional ones. Both the projection loss Lp and
the traditional loss Ls are used to train PCNNs. We illustrate our network structure Basic
Block Unit based on ResNet, and more specific details are shown in the dotted box (pro-
jection convolution layer). © indicates the concatenation operation on the channels. Note
that inference does not use projection matrices Wjl and full-precision kernels Cil .
ible projection scheme, we obtain diverse binarized models with higher performance than
the previous ones.
3.5.1 Projection
In our work, we define the quantization of the input variable as a projection onto a set;
where each element ai , i = 1, 2, ..., U satisfies the constraint a1 < a2 < ... < aU , and is the
discrete value of the input variable. Then we define the projection of x ∈ R onto Ω as
where ω is a projection matrix and ◦ denotes the Hadamard product. Equation 3.29 indicates
that the projection aims to find the closest discrete value for each continuous value x.
3.5.2 Optimization
Minimizing f (x) based on the discrete optimization or integer programming method, whose
variables are restricted to discrete values, becomes more challenging when training a
52 Algorithms for Binary Neural Networks
large-scale problem on a huge data set [53]. We propose to solve the problem within the
backpropagation framework by considering: 1) the inference process of the optimized model
is based on the quantized variables, which means that the variable must be quantized in
the forward pass (corresponding to the inference) during training, and the loss is calculated
based on the quantized variables; the backpropagation is not necessarily to be quantized,
which however needs to consider the relationship between quantized variables and their
counterparts fully. Based on the above considerations, we propose that in the kth iteration,
based on the projection in Eq. 3.29, x[k] is quantized to x̂[k] in the forward pass as
min f (ω, x)
[k] (3.31)
s.t. x̂j = PΩj (ωj , x),
where ωj , j ∈ {1, ..., J} is the jth projection matrix3 , and J is the total number of projection
matrices. To solve the problem in (3.31), we define our update rule as
[k]
x ← x[k] − ηδx̂ , (3.32)
where the superscript [k + 1] is removed from x, δx̂ is the gradient of f (ω, x) with respect to
x = x̂, and η is the learning rate. The quantization process x̂[k] ← x[k] , that is, PΩj (ωj , x[k] ),
[k]
is equivalent to finding the projection of ωj ◦ (x + ηδx̂ ) onto Ω as
[k]
x̂[k] = arg min{x̂ − ωj ◦ (x + ηδx̂ )2 , x̂ ∈ Ω}. (3.33)
x̂
Obviously, x̂[k] is the solution to the problem in (3.33). So, by incorporating (3.33) into
f (ω, x), we obtain a new formulation for (3.31) based on the Lagrangian method as
λ [k]
J
[k]
min f (ω, x) + x̂ − ωj ◦ (x + ηδx̂ )2 . (3.34)
2 j
The newly added part (right) shown in (3.34) is a quadratic function and is referred to as
projection loss.
In this case, we only consider one projection function, so the subscript j of ωj is omitted for
simplicity. For multiple projections, the analysis is given after that. In the forward step, only
the discrete kernel values participate in the calculation, so their gradients can be obtained
by
as ω and x̂ are bilinear with each other as ω ◦ x̂[k] . In our discrete optimization framework,
the discrete values of convolutional kernels are updated according to their gradients. Taking
Eq. 3.36 into consideration, we derive the update rule for x̂[k+1] as
1 −1
J
x= ω ◦ xˆj , (3.41)
J j j
which shows that multiple projections can better reconstruct the full kernels based on
binaries counterparts.
λ
L,I J
Lp =
l,[k] l,[k] ◦ (C l,[k] + ηδ l,[k] )||2 ,
||Ĉi,j − W (3.42)
j i Ĉi,j
2 j l,i
l,[k]
where Ci , l ∈ {1, ..., L}, i ∈ {1, ..., I} denotes the ith kernel tensor of the lth convolutional
l,[k] l,[k]
layer in the kth iteration. Ĉi,j is the quantized kernel of Ci via projection PΩl,j , j ∈
{1, ..., J} as
l,[k] l,[k] , C l,[k] ),
Ĉi,j = PΩl,j (W (3.43)
j i
Fhl+1 = l
Fh,1 ⊕ · · · ⊕ Fh,J
l
, (3.48)
where ⊗ denotes the convolutional l+1
operation. Fh,j is the jth channel of the hth feature
l
map at the (l + 1)th convolutional layer and Fh denotes the hth feature map at the lth
convolutional layer. To be more precise, for example, when h = 1, for the jth channel of
l+1
an output feature map, F1,j is the sum of the convolutions between all the h input feature
maps and i corresponding quantized kernels. All channels of the output feature maps are
l+1 l+1 l+1
obtained as Fh,1 , .., Fh,j , ..., Fh,J , and they are concatenated to construct the hth output
l+1
feature map Fh .
It should be emphasized that we can utilize multiple projections to increase the diversity
of convolutional kernels Dl . However, the single projection can perform much better than the
existing BNNs. The essential is the use of DBPP, which differs from [147] based on a single
quantization scheme. Within our convolutional scheme, there is no dimension disagreement
on feature maps and kernels in two successive layers. Thus, we can replace the traditional
convolutional layers with ours to binarize widely used networks, such as VGGs and ResNets.
At inference time, we only store the set of quantized kernels Dil instead of the full-precision
ones; that is, projection matrices Wjl are not used for inference, achieving a reduction in
storage.
∂LP J
=λ l ◦ C l + ηδ l − Ĉ l ◦ W
W l, (3.52)
l j i Ĉi,j i,j j
∂Ci j
where 1 is the indicator function [199] widely used to estimate the gradient of the nondif-
ferentiable function. More specifically, the output of the indicator function is 1 only if the
condition is satisfied; otherwise, 0. Updating Wjl : Likewise, the gradient of the projection
parameter δWjl consists of the following two parts
∂L ∂LS ∂LP
δWjl = l
= l
+ , (3.53)
∂Wj ∂Wj ∂Wjl
where η2 is the learning rate for Wjl . We also have the following.
∂LS J
∂LS
=
∂Wjl l
∂W
h j h
l,j l
l
j , Ci ) ∂(Wj ◦ Ci )
J I l l
∂LS ∂PΩN (W
= (3.55)
l l
i ∂ Ĉi,j ∂(Wj ◦ Ci )
l l
h ∂W j h
∂LS
J I
= l
◦ 1−1≤W l ◦C l ≤1 ◦ Ci
l
,
h i ∂ Ĉ i,j
j i
h
I
∂LP
J l l
= λ j ◦ Ci +ηδ l − Ĉi,j
W l
◦ Cil +ηδĈ l , (3.56)
l Ĉi,j
∂Wj h i
i,j
h
where h indicates the hth plane of the tensor along the channels. It shows that the proposed
algorithm can be trained from end to end, and we summarize the training procedure in
Algorithm 13. In the implementation, we use the mean of W in the forward process but
keep the original W in the backward propagation.
Note that in PCNNs for BNNs, we set U = 2 and a2 = −a1 . Two binarization processes
are used in PCNNs. The first is the kernel binarization, which is done based on the projec-
tion onto ΩN , whose elements are calculated based on the mean absolute values of all full
precision kernels per layer [199] as
1 l
I
Ci 1 , (3.57)
I i
We believe that compressed ternary CNNs such as TTN [299] and TWN [130] have
better initialization states for binary CNNs. Theoretically, the performance of models with
ternary weights is slightly better than those with binary weights and far worse than those
of real-valued ones. Still, they provide an excellent initialization state for 1-bit CNNs in
our proposed progressive optimization framework. Subsequent experiments show that our
PCNNs trained from a progressive optimization strategy perform better than those from
scratch, even better than the ternary PCNNs from scratch.
The discrete set for ternary weights is a special case, defined as Ω := {a1 , a2 , a3 }. We
further require a1 = −a3 = Δ as Eq. 3.57 and a2 = 0 to be hardware friendly [130].
Regarding the threshold for ternary weights, we follow the choice made in [229] as
σ l
I
Δl = σ × E(|C l |) ≈ Ci 1 , (3.58)
I i
where σ is a constant factor for all layers. Note that [229] applies to Eq. 3.58 on convolutional
inputs or feature maps; we find it appropriate in convolutional weights as well. Consequently,
we redefine the projection in Eq. 3.29 as
PΩ (ω, x) = arg min ω ◦ x − 2ai , i ∈ {1, ..., U }. (3.59)
ai
In our proposed progressive optimization framework, the PCNNs with ternary weights
(ternary PCNNs) are first trained from scratch and then served as pre-trained models to
progressively fine-tune the PCNNs with binary weights (binary PCNNs).
PCNN: Projection Convolutional Neural Networks 57
1 2
3 4
FIGURE 3.13
In our proposed progressive optimization framework, the two additional losses, projection
loss, and center loss are simultaneously optimized in continuous and discrete spaces, opti-
mally combined by the projection approach in a theoretical framework. The subfigure on
the left explains the softmax function in the cross-entropy loss. The subfigure in the mid-
dle illustrates the process of progressively turning ternary kernel weights into binary ones
within our projection approach. The subfigure on the right shows the function of center loss
to force the learned feature maps to cluster together, class by class.
γ
m
LC = xi − cyi 22 , (3.60)
2 i=1
where m denotes the total number of samples or batch size, and γ is a hyperparameter to
balance the center loss with other losses. More details on center loss can be found in [245].
By incorporating Eq. 3.60 into Eq. 3.110, the total loss is updated as
L = LS + L P + L C . (3.61)
We note that the center loss is successfully deployed to handle feature variations in the
training and will be omitted in the inference, so there is no additional memory storage
and computational cost. More intuitive illustrations can be found in Fig. 3.13, and a more
detailed training procedure is described in Algorithm 3.
58 Algorithms for Binary Neural Networks
FIGURE 3.14
We visualize the distribution of kernel weights of the first convolution layer of PCNN-22.
The variance increases when the ratio decreases λ, which balances projection loss and cross-
entropy loss. In particular, when λ = 0 (no projection loss), only one group is obtained,
where the kernel weights are distributed around 0, which could result in instability during
binarization. In contrast, two Gaussians (with projection loss, λ > 0) are more powerful
than the single one (without projection loss), which thus results in better BNNs, as also
validated in Table 3.2.
curves) converge faster than PCNNs with λ = 0 (yellow curves) when the epoch number
> 150.
Diversity visualization In Fig. 3.17, we visualize four channels of the binary kernels Dil
in the first row, the feature maps produced by Dil in the second row, and the corresponding
feature maps after binarization in the third row when J=4. This way helps illustrate the
diversity of kernels and feature maps in PCNNs. Thus, multiple projection functions can
capture diverse information and perform highly based on compressed models.
FIGURE 3.15
With λ fixed to 1e − 4, the variance of the kernel weights decreases from the 2nd epoch to
the 200th epoch, which confirms that the projection loss does not affect the convergence.
60 Algorithms for Binary Neural Networks
FIGURE 3.16
Training and testing curves of PCNN-22 when λ=0 and 1e − 4, which shows that the
projection affects little on the convergence.
FIGURE 3.17
Illustration of binary kernels Dil (first row), feature maps produced by Dil (second row),
and corresponding feature maps after binarization (third row) when J=4. This confirms
the diversity in PCNNs.
RBCN: Rectified Binary Convolutional Networks with Generative Adversarial Learning 61
TABLE 3.2
With different λ, the accuracy of PCNN-22
and PCNN-40 based on WRN-22 and
WRN-40, respectively, on CIFAR10 dataset.
λ
Model
1e − 3 1e − 4 1e − 5 0
PCNN-22 91.92 92.79 92.24 91.52
PCNN-40 92.85 93.78 93.65 92.84
Despite the progress made in 1-bit quantization and network pruning, few works have
combined the two in a unified framework to reinforce each other. It is necessary to introduce
pruning techniques into 1-bit CNNs since not all filters and kernels are equally important
or worth quantizing in the same way. One potential solution is to prune the network and
perform a 1-bit quantization over the remaining weights to produce a more compressed
network. However, this solution does not consider the difference between binarized and full
precision parameters during pruning. Therefore, a promising alternative is to prune the
quantized network. However, designing a unified framework to combine quantization and
pruning is still an open question.
To address these issues, we introduce a rectified binary convolutional network
(RBCN) [148] to train a BNN, in which a novel learning architecture is presented in a
GAN framework. Our motivation is based on the fact that GANs can match two data
distributions (the full-precision and 1-bit networks). This can also be viewed as distill-
ing/exploiting the full precision model to benefit its 1-bit counterpart. For training RBCN,
the primary process for binarization is illustrated in Fig. 6.10, where the full-precision model
and the 1-bit model (generator) provide “real” and “fake” feature maps to the discrimina-
FIGURE 3.18
This figure shows the framework for integrating the Rectified Binary Convolutional Network
(RBCN) with Generative Adversarial Network (GAN) learning. The full precision model
provides “real” feature maps, while the 1-bit model (as a generator) provides “fake” feature
maps to discriminators trying to distinguish “real” from “fake.” Meanwhile, the generator
tries to make the discriminators work improperly. When this process is repeated, both
the full-precision feature maps and kernels (across all convolutional layers) are sufficiently
employed to enhance the capacity of the 1-bit model. Note that (1) the full precision model
is used only in learning but not in inference; (2) after training, the full precision learned
filters W are discarded, and only the binarized filters Ŵ and the shared learnable matrices
C are kept in RBCN for the calculation of the feature maps in inference.
62 Algorithms for Binary Neural Networks
tors. The discriminators try to distinguish the “real” from the “fake,” and the generator
tries to make the discriminators unable to work well. The result is a rectified process and a
unique architecture with a more precise estimation of the full precision model. Pruning is
also explored to improve the applicability of the 1-bit model in practical applications in the
GAN framework. To accomplish this, we integrate quantization and pruning into a unified
framework.
arg min max L = LAdv (W, Ŵ , C, Y ) + LS (W, Ŵ , C) + LKernel (W, Ŵ , C), (3.62)
W,Ŵ ,C Y
where D(·) consists of a series of basic blocks, each containing linear and LeakyRelu layers.
We also have multiple discriminators to rectify the binarization training process.
In addition, LKernel (W, Ŵ , C) denotes the kernel loss between the learned full precision
filters W and the binarized filters Ŵ and is defined as:
where i represents the ith channel and l represents the lth layer. In Eq. 3.65, the objective
is to obtain W , Ŵ and C with Y fixed, which is why the term D(R; Y ) in Eq. 6.79 can
be ignored. The update process for Y is found in Algorithm 13. The advantage of our
formulation in Eq. 3.65 lies in that the loss function is trainable, which means it can be
easily incorporated into existing learning frameworks.
∂L ∂L ∂ Wˆil
δWil = = , (3.67)
∂Wil ∂ Wˆil ∂Wi
l
where ⎧
⎨1.2 + 2Wil , −1 ≤ Wil < 0,
∂ Ŵil
= 2 − 2Wil , 0 ≤ Wil < 1, (3.68)
∂Wil ⎩
10, otherwise,
which is an approximation of 2× the Dirac delta function [159]. Furthermore,
∂L ∂LS ∂LKernel ∂LAdv
= + + , (3.69)
∂ Ŵil ∂ Ŵil ∂ Ŵil ∂ Ŵil
and
Wil ← Wil − η1 δWil , (3.70)
where η1 is the learning rate. Then,
∂LKernel
= −λ1 (Wil − C l Ŵil )C l , (3.71)
∂ Ŵil
∂LAdv ∂D
= −2(1 − D(Til ; Y )) . (3.72)
∂ Ŵil ∂ Ŵil
Update C: We further update the learnable matrix C l with W l fixed. Let δC l be the
gradient of C l . Then we have
∂LS ∂LKernel ∂LAdv
δC l = + + , (3.73)
∂C l ∂C l ∂C l
and
C l ← C l − η2 δC l , (3.74)
where η2 is another learning rate. Furthermore,
∂LKernel
l
= −λ1 (Wil − C l Ŵil )Ŵil , (3.75)
∂C i
∂LAdv ∂D
l
=− 2(1 − D(Til ; Y )) l . (3.76)
∂C i
∂C
These derivations show that the rectified process is trainable in an end-to-end manner.
The complete training process is summarized in Algorithm 13, including how to update
the discriminators. As described in line 17 of Algorithm 13, we independently update other
parameters while fixing the convolutional layer’s parameters to enhance each layer’s feature
maps’ variety. This way, we speed up the training convergence and fully explore the potential
of 1-bit networks. In our implementation, all the values of C l are replaced by their average
during the forward process. A scalar, not a matrix, is involved in inference, thus speeding
up computation.
64 Algorithms for Binary Neural Networks
where Lp is the pruning loss function, and the forms of LAdv p (Wp , Ŵp , Cp , Mp , Yp ) and
LKernel p (Wp , Ŵp , Cp ) are
where ◦ is an operator that obtains the pruned weight with mask Mp . The other part of
the forward propagation in the pruned RBCNs is the same as in the RBCNs.
In pruned RBCNs, what needs to be learned and updated are full precision filters Wp ,
learnable matrices Cp , and soft mask Mp . In each convolutional layer, these three sets of
parameters are jointly learned.
Update Mp . Mp is updated by FISTA [141] with the initialization of α(1) = 1. Then
we obtain the following.
1
α(k+1) = (1 + 1 + 4α(k)2 ), (3.84)
2
a(k) − 1
y(k+1) = Mp,(k) + (Mp,(k) − Mp,(k−1) ), (3.85)
a(k+1)
∂(LAdv p + LData p )
Mp,(k+1) = proxη(k+1)λ||·||1 (y(k+1) − ηk+1 ), (3.86)
∂(y(k+1) )
where ηk+1 is the learning rate in iteration k + 1 and proxη(k+1)λ||·||1 (zi ) = sign(zi ) · (|zi | −
η0 λ)+ , more details can be found in [142].
l
Update Wp . Let δWp,i l be the gradient of the full precision filter Wp,i . During backprop-
l l
agation, the gradients pass to Ŵp,i first and then to Wp,i . Furthermore,
and
l
Wp,i ← Wp,i
l
− ηp,1 δWp,i
l , (3.88)
∂LKernel p ∂LAdv p
where ηp,1 is the learning rate, l
∂ Ŵp,i
and l
∂ Ŵp,i
are
∂LKernel p
l
= −λ1 (Wp,i
l
− Cpl Ŵp,i
l
)Cpl , (3.89)
∂ Ŵp,i
∂LAdv p ∂Dp
l
= −2(1 − D(Tp,i
l
; Yp )) l
. (3.90)
∂ Ŵp,i ∂ Ŵp,i
And
∂LData p 1 ∂Tp
= − (Rp − Tp ) , (3.91)
l
∂ Ŵp,i n l
∂ Ŵp,i
Update Cp . We further update the learnable matrix Cpl with Wpl and Mpl fixed. Let δCpl
be the gradient of Cpl . Then we have
and
Cpl ← Cpl − ηp,2 δCpl . (3.93)
∂LKernel p ∂LAdv p
and ∂Cpl
and ∂Cpl
are
∂LKernel p
l
= −λ1 l
(Wp,i − Cpl Ŵp,i
l l
)Ŵp,i , (3.94)
∂Cp i
66 Algorithms for Binary Neural Networks
∂LAdv p
∂Dp
=− 2(1 − Dp (Tp,i
l
; Yp )) . (3.95)
∂Cpl i
∂Cpl
Furthermore,
∂LData p 1 ∂Tp
l
= (Rp − Tp ) l . (3.96)
∂Cp n i ∂Cp
The complete training process is summarized in Algorithm 4, including the update of the
discriminators.
3) We further improve RBCNs by updating the BN layers with W and C fixed after
each epoch (line 17 in Algorithm 13). This further increases our accuracy by 2.51% (61.64%
vs. 59.13%) in CIFAR100 with 32-32-64-128.
1 2
3 4
FIGURE 3.19
The evolution of the prior p(x), the distribution of the observation y, and the posterior
p(x|y) during learning, where x is the latent variable representing the full-precision param-
eters and y is the quantization error. Initially, the parameters x are initialized according
to a single-mode Gaussian distribution. When our learning algorithm converges, the ideal
case is that (i) p(y) becomes a Gaussian distribution N (0, ν), which corresponds to the
minimum reconstruction error, and (ii) p(x|y) = p(x) is a Gaussian mixture distribution
with two modes where the binarized values x̂ and −x̂ are located.
68 Algorithms for Binary Neural Networks
cant advantages in solving probabilistic graphical models. It can help achieve information
exchange between the perception task and the inference task, conditional dependencies on
high-dimensional data, and effective uncertainty modeling. [14, 124] have been extensively
studied in Bayesian neural networks (BayesNNs). More recent developments establishing
the efficacy of BayesNNs can be found in [215, 139] and the references therein. Estimating
the posterior distribution is a vital part of Bayesian inference and represents the information
on the uncertainties for both the data and the parameters. However, an exact analytical so-
lution for the posterior distribution is intractable, as the number of parameters is large and
the functional form of a neural network does not lend itself to exact integration [16]. Several
approaches have been proposed for solving posterior distributions of weights of BayesNNs,
based on optimization-based techniques such as variational inference (VI) and sampling-
based approaches, such as Markov Chain Monte Carlo (MCMC). MCMC techniques are
typically used to obtain sampling-based estimates of the posterior distribution. BayesNNs
with MCMC have not seen widespread adoption due to the computational cost of time and
storage on a large dataset [120].
In contrast to MCMC, VI tends to converge faster and has been applied to many popular
Bayesian models, such as factorial and topic models [15]. The basic idea of VI is that it
first defines a family of variational distributions and then minimizes the Kullback-Leibler
(KL) divergence concerning the variational family. Many recent works have discussed the
application of variational inference to BayesNNs, e.g., [16, 216].
Despite the progress made in 1-bit or network pruning, little work has combined quanti-
zation and pruning in a unified framework to reinforce each other. However, it is necessary
to introduce pruning techniques into 1-bit CNNs. Not all filters and kernels are equally es-
sential and worth quantizing in the same way, as validated subsequently in our experiments.
One potential solution is to prune the network first and then perform a 1-bit quantization
on the remaining network to have a more compressed network. However, such a solution
does not consider the difference between the binarized and full-precision parameters during
pruning. Instinctively, 1-bit CNNs tend to be easily pruned, as CNNs are more redundant
before and after binarization [150]. Thus, one promising alternative is to conduct pruning
over BNNs. However, it remains an open problem to design a unified framework to calcu-
late a 1-bit network first and then prune it. In particular, due to the deterioration of the
representation ability in 1-bit networks, the backpropagation process can be susceptible to
parameter updates, making existing optimization schemes [77] fail.
To address this problem, we use Bayesian learning, a well-established global optimization
scheme [174],[16], to prune 1-bit CNNs. First, Bayesian learning binarizes the full-precision
kernels to two quantization values (centers) to obtain 1-bit CNNs. The quantization error
is minimized when the full-precision kernels follow a Gaussian mixture model, with each
Gaussian centered on its corresponding quantization value. Given two centers for 1-bit
CNNs, two Gaussians that form the mixture model are used to model the full-precision
kernels. Subsequently, the Bayesian learning framework establishes a new pruning operation
to prune 1-bit CNNs. In particular, we divide the filters into two groups, assuming that
those in one group follow the same Gaussian distribution. Then, their average is used to
replace the weights of the filters in this group. Figure 3.20 illustrates the general framework
where three innovative elements are introduced to the learning procedure of 1-bit CNNs
with compression: (1) minimizing the reconstruction error of the parameters before and
after quantization, (2) Modeling the parameter distribution as a Gaussian mixture with
two modes centered on the binarized values, and (3) pruning the quantized network by
maximizing a posterior probability. Further analysis led to our three new losses and the
corresponding learning algorithms, referred to as Bayesian kernel loss, Bayesian feature loss,
and Bayesian pruning loss. These three losses can be jointly applied with the conventional
cross-entropy loss within the same back-propagation pipeline. The advantages of Bayesian
BONN: Bayesian Optimized Binary Neural Network 69
(4
)UT\
8K2;
(4
)UT\
8K2;
FIGURE 3.20
By considering the prior distributions of the kernels and features in the Bayesian frame-
work, we achieve three new Bayesian losses to optimize the 1-bit CNNs. The Bayesian kernel
loss improves the layerwise kernel distribution of each convolution layer, the Bayesian fea-
ture loss introduces the intraclass compactness to alleviate the disturbance induced by the
quantization process, and the Bayesian pruning loss centralizes channels following the same
Gaussian distribution for pruning. The Bayesian feature loss is applied only to the fully
connected layer.
learning are intrinsically inherited during model quantization and pruning. The proposed
losses can also comprehensively supervise the 1-bit CNN training process concerning kernel
and feature distributions. Finally, a new direction on 1-bit CNN pruning is explored further
to improve the compressed model’s applicability in practical applications.
is the reconstruction error that is assumed to obey a Gaussian prior with zero mean and
variance ν. Under the most probable y (corresponding to y = 0 and x = w−1 ◦ x̂, i.e., the
minimum reconstruction error), we maximize p(x|y) to optimize x for quantization (e.g.,
1-bit CNNs) as:
max p(x|y), (3.98)
which can be solved based on Bayesian learning that uses Bayes’ theorem to determine the
conditional probability of a hypothesis given limited observations. We note that the calcu-
lation of BNNs is still based on optimizing x, as shown in Fig. 3.19, where the binarization
is performed based on the sign function. Equation 3.98 is complicated and difficult to solve
due to the unknown w−1 as shown in Eq. 3.97. From a Bayesian learning perspective, we
resolve this problem via Maximum A posteriori (MAP):
max p(x|y) = max p(y|x)p(x)
(3.99)
= min ||x̂ − w ◦ x||22 − 2ν log p(x) ,
where
1 1
p(y|x) ∝ exp(− ||y||22 ) ∝ exp(− ||x̂ − w ◦ x||22 ). (3.100)
2ν 2ν
In Eq. 3.100, we assume that all components of the quantization error y are i.i.d., thus
resulting in a simplified form. As shown in Fig. 3.19, for 1-bit CNNs, x is usually quantized
to two numbers with the same absolute value. We neglect the overlap between the two
numbers, and thus p(x) is modeled as a Gaussian mixture with two modes:
1 1 (x − μ)T Ψ−1 (x − μ)
p(x) = (2π)− 2 det(Ψ)− 2 exp −
N
2 2
(x + μ)T Ψ−1 (x + μ)
+ exp −
2
(3.101)
1 −N − 12
(x+ −μ+)TΨ−1 + (x+ − μ+ )
≈ (2π) 2 det(Ψ) exp −
2 2
(x− + μ− )T Ψ−1 − (x − + μ − )
+ exp − ,
2
where x is divided into x+ and x− according to the signs of the elements in x, and N is
the dimension of x. Accordingly, Eq. 3.99 can be rewritten as:
min||x̂ − w ◦ x||22 + ν(x+ − μ+ )T Ψ−1
+ (x+ − μ+ )
T −1
(3.102)
+ ν(x− + μ− ) Ψ− (x− + μ− ) + ν log det(Ψ) ,
where μ− and μ+ are solved independently. det(Ψ) is accordingly set to be the determinant
of the matrix Ψ− or Ψ+ . We call Eq. 3.102 the Bayesian kernel loss.
Bayesian feature loss: We also design a Bayesian feature loss to alleviate the disturbance
caused by the extreme quantization process in 1-bit CNNs. Considering the intraclass com-
pactness, the features fm of the m-th class supposedly follow a Gaussian distribution with
the mean cm as revealed in the center loss [245]. Similarly to the Bayesian kernel loss, we
define yfm = fm − cm and yfm ∼ N (0, σm ), and we have:
Nf
−2
min||fm − cm ||22 + σm,n (fm,n −cm,n ) +log(σm,n ) ,
2 2 (3.103)
n=1
which is called the Bayesian feature loss. In Eq. 3.103, σm,n , fm,n , and cm,n are the n-th
elements of σm , fm , and cm , respectively. We take the latent distributions of kernel weights
and features into consideration in the same framework and introduce Bayesian losses to
improve the capacity of 1-bit CNNs.
BONN: Bayesian Optimized Binary Neural Network 71
respectively, and H l and W l are the height and width of the kernels, respectively. For
clarity, we define
K l = [K1l , K2l , ..., KC
l
l ],
o
(3.104)
where Kil , i = 1, 2, ..., Col , is a 3-dimensional filter ∈ RCi ×H ×W . For simplicity, l is omitted
l l l
from the remainder of this section. To prune 1-bit CNNs, we assimilate similar filters into
the same one based on a controlling learning process. To do this, we first divide K into
different groups using the K-means algorithm and then replace the filters of each group by
their average during optimization. This process assumes that Ki in the same group follows
the same Gaussian distribution during training. Then the pruning problem becomes how
to find the average K to replace all Ki ’s, which follows the same distribution. It leads to a
similar problem as in Eq. 3.99. It should be noted that the learning process with a Gaussian
distribution constraint is widely considered in [82].
Accordingly, Bayesian learning is used to prune 1-bit CNNs. We denote as the difference
between a filter and its mean, i.e., = K − K, following a Gaussian distribution for
simplicity. To calculate K, we minimize based on MAP in our Bayesian framework, and
we have
K = arg max p(K|) = arg max p(|K)p(K), (3.105)
K K
1 1
p(|K) ∝ exp(− ||||22 ) ∝ exp(− ||K − K||22 ), (3.106)
2ν 2ν
and p(K) is similar to Eq. 3.101 but with one mode. Thus, we have
which is called the Bayesian pruning loss. In summary, our Bayesian pruning solves the
problem more generally, assuming that similar kernels follow a Gaussian distribution and
will finally be represented by their centers for pruning. From this viewpoint, we can obtain
a more general pruning method, which is more suitable for binary neural networks than
the existing ones. Moreover, we take the latent distributions of kernel weights, features, and
filters into consideration in the same framework and introduce Bayesian losses and Bayesian
pruning to improve the capacity of 1-bit CNNs. Comparative experimental results on model
pruning also demonstrate the superiority of our BONNs [287] over existing pruning methods.
3.7.4 BONNs
We employ the three Bayesian losses to optimize 1-bit CNNs, which form our Bayesian
Optimized 1-bit CNNs (BONNs). To do this, we reformulate the first two Bayesian losses
72 Algorithms for Binary Neural Networks
where knl,i , l ∈ {1, ..., L}, i ∈ {1, ..., Col }, n ∈ {1, ..., Cil }, is the vectorization of the i-th kernel
matrix at the l-th convolutional layer, wl is a vector used to modulate knl,i , and μli and Ψli
are the mean and covariance of the i-th kernel vector at the l-th layer, respectively. And
we term LB the Bayesian optimization loss. Furthermore, we assume that the parameters
in the same kernel are independent. Thus Ψli becomes a diagonal matrix with the identical
value (σil )2 , where (σil )2 is the variance of the i-th kernel of the l-th layer. In this case, the
calculation of the inverse of Ψli is sped up, and all the elements of μli are identical and equal
to μli . Note that in our implementation, all elements of wl are replaced by their average
during the forward process. Accordingly, only a scalar instead of a matrix is involved in the
inference, and thus the computation is significantly accelerated.
After training 1-bit CNNs, Bayesian pruning loss LP is then used for the optimization
of feature channels, which can be written as:
L
Jl
Ij
l
LP = ||Ki,j
l
− K j ||22
l=1 j=1 i=1 (3.109)
l l
l
+ ν(Ki,j − K j )T (Ψlj )−1 (Ki,j
l
− K j ) + ν log det(Ψlj ) ,
l
where Jl is the number of Gaussian clusters (groups) of the l-th layer, and Ki,j , i =
l
1, 2, ..., Ij , are those Ki ’s that belong to the j-th group. In our implementation, we define
Jl = int(Col × ), where is a predefined pruning rate. In this chapter, we use one for all
l l l
layers. Note that when the j-th Gaussian just has one sample Ki,j ,K j = Ki,j and Ψj is a
unit matrix.
In BONNs, the cross-entropy loss LS , the Bayesian optimization loss LB , and the
Bayesian pruning loss LP are aggregated together to build the total loss as:
L = LS + LB + ζLP , (3.110)
where ζ is 0 in binarization training and becomes 1 in pruning. The loss of Bayesian kernels
constrains the distribution of the convolution kernels to a symmetric Gaussian mixture with
two modes. It simultaneously minimizes the quantization error through the ||k̂nl,i −wl ◦knl,i ||22
term. Meanwhile, the Bayesian feature loss modifies the distribution of the features to reduce
intraclass variation for better classification. The Bayesian pruning loss converges kernels
similar to their means and thus compresses the 1-bit CNNs further.
where w denotes a learned vector to reconstruct the full precision vector and is shared in a
layer. As mentioned in Section 3.2, during forward propagation, wl becomes a scalar wl in
each layer, where wl is the mean of wl and is calculated online. The convolution process is
represented as
O l+1 = ((wl )−1 K̂ l ) ∗ Ô l = (wl )−1 (K̂ l ∗ Ô l ), (3.111)
where Ô l denotes the binarized feature map of the l-th layer, and Ol+1 is the feature map
of the (l + 1)-th layer. As in Eq. 3.111 depicts, the actual convolution is still binary, and
Ol+1 is obtained by simply multiplying (wl )−1 and the binarization convolution. For each
layer, only one floating-point multiplication is added, which is negligible for BONNs.
In addition, we consider the Gaussian distribution in the forward process of Bayesian
pruning, which updates every filter in one group based on its mean. Specifically, we replace
l
l
each filter Ki,j = (1 − γ)Ki,jl
+ γK j during pruning.
∂L ∂LS ∂LB
δknl,i = = + . (3.112)
∂knl,i ∂knl,i ∂knl,i
74 Algorithms for Binary Neural Networks
where 1 is the indicator function that is widely used to estimate the gradient of nondiffer-
entiable parameters [199], and (σil )−2 is a vector whose elements are all equal to (σil )−2 .
Updating wl : Unlike the forward process, w is used in backpropagation to calculate the
gradients. This process is similar to the way to calculate x̂ from x asynchronously. Specifi-
cally, δwl is composed of the following two parts:
∂L ∂LS ∂LB
δw l = l
= l
+ . (3.115)
∂w ∂w ∂wl
For each term in Eq. 3.115, we have:
NI
∂LS l I
l
∂LS ∂ k̂nl,i ∂(wl ◦ knl,i )
= l,i l,i
i=1 n=1 ∂ k̂n ∂(w ◦ kn )
∂w l l ∂wl
(3.116)
Il N
IL
∂LS
= ◦ 1−1≤wl ◦knl,i ≤1 ◦ knl,i ,
i=1 n=1 ∂ k̂nl,i
N Il
∂LB Il
=λ (wl ◦ knl,i − k̂nl,i ) ◦ knl,i . (3.117)
∂wl i=1 n=1
Updating μli and σil : Note that we use the same μli and σil for each kernel (see Section
3.2). So, the gradients here are scalars. The gradients δμli and δσil are calculated as:
∂L ∂LB
δμli = l
=
∂μi ∂μli
Ci H ×W l (3.118)
(σil )−2 (μli − kn,p
l l
λν l,i
), l,i
kn,p ≥ 0,
= l
Ci ×H l ×W l n=1 p=1 (σil )−2 (μli + kn,p
l,i
), l,i
kn,p < 0,
∂L ∂LB
δσil = l
=
∂σi ∂σil
Cil H l×W l (3.119)
λν −(σil )−3(kn,p
l,i
−μli )2+(σil )−1,kn,p
l,i
≥ 0,
= l
Ci×H ×W n=1 p=1 −(σi ) (kn,p +μi ) +(σi ) ,kn,p < 0,
l l l −3 l,i l 2 l −1 l,i
l,i
where kn,p , p ∈ {1, ..., H l × W l }, denotes the p-th element of knl,i . In the fine-tuning process,
we update cm using the same strategy as center loss [245]. The update of σm,n based on
LB is straightforward and is not elaborated here for brevity.
BONN: Bayesian Optimized Binary Neural Network 75
l
Updating Ki,j : In pruning, we aim to converge the filters to their mean gradually. So
l l
we replace each filter Ki,j with its corresponding mean K i,j . The gradient of the mean is
represented as follows:
∂L ∂LS ∂LB ∂LP
l
= l
+ l
+ l
∂Ki,j ∂Ki,j ∂Ki,j ∂Ki,j
l l
∂LS ∂K j ∂LB ∂K j ∂LP
= l ∂K l
+ l ∂K l
+ l
∂K j i,j ∂K j i,j ∂K i,j (3.120)
1 ∂LS ∂LB
= l
+ l
l
+ 2(Ki,j −K j )
Ij ∂K ∂K
j j
TABLE 3.5
Effect of Bayesian losses on the ImageNet data
set. The backbone is ResNet-18.
Bayesian kernel loss
Bayesian feature loss
Top-1 56.3 58.3 58.4 59.3
Accuracy
Top-5 79.8 80.8 80.8 81.6
BONN: Bayesian Optimized Binary Neural Network 77
FIGURE 3.21
The images on the left are the input images chosen from the ImageNet ILSVRC12 dataset.
Right images are feature maps and binary feature maps from different layers of BONNs.
The first and third rows are feature maps for each group, while the second and fourth rows
are corresponding binary feature maps. Although binarization of the feature map causes
information loss, BONNs could extract essential features for accurate classification.
Weight Distribution Figure 3.23 further illustrates the distribution of the kernel weights,
with λ fixed to 1e − 4. During the training process, the distribution gradually approaches
the two-mode GMM, as assumed previously, confirming the effectiveness of the Bayesian
kernel loss in a more intuitive way. We also compare the kernel weight distribution between
XNOR-Net and BONN. As shown in Fig. 3.24, the kernel weights learned in XNOR-Net
are tightly distributed around the threshold value, but those in BONN are regularized in a
50 70
45
60
Accuracy
Accuracy
40
35 50
30
40
25
BONN-Train 30 BONN-Train
20 BONN-Test BONN-Test
XNOR-Train XNOR-Train
15 XNOR-Test XNOR-Test
20
10
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Epoch Epoch
FIGURE 3.22
Training and test accuracies on ImageNet when λ = 1e − 4 shows the superiority of the
proposed BONN over XNOR-Net. The backbone of the two networks is ResNet-18.
78 Algorithms for Binary Neural Networks
FIGURE 3.23
We demonstrate the kernel weight distribution of the first binarized convolution layer of
BONNs. Before training, we initialize the kernels as a single-mode Gaussian distribution.
From the 2-th epoch to the 200-th epoch, with λ fixed to 1e − 4, the distribution of the
kernel weights becomes more and more compact with two modes, which confirms that the
Bayesian kernel loss can regularize the kernels into a promising distribution for binarization.
two-mode GMM style. Figure 3.25 shows the evolution of the binarized values during the
training process of XNOR-Net and BONN. The two different patterns indicate that the
binarized values learned in BONN are more diverse.
Effectiveness of Bayesian Feature Loss on Real-Valued Models: We apply our
Bayesian feature loss on real-value models, including ResNet-18 and ResNet-50 [84]. We
retrain these two backbones with our Bayesian feature loss for 70 epochs. We set the hy-
perparameter θ to 1e − 3. The SGD optimizer has an initial learning rate set to 0.1. We use
FIGURE 3.24
The weight distributions of XNOR and BONN are based on WRN-22 (2nd, 8th, and 14th
convolutional layers) after 200 epochs. The weight distribution difference between XNOR
and BONN indicates that the kernels are regularized across the convolutional layers with
the proposed Bayesian kernel loss.
RBONN: Recurrent Bilinear Optimization for a Binary Neural Network 79
FIGURE 3.25
Evolution of the binarized values, |x|s, during the XNOR and BONN training process. They
are both based on WRN-22 (2nd, 3rd, 8th, and 14th convolutional layers), and the curves
do not share the same y-axis. The binarized values of XNOR-Net tend to converge to small
and similar values, but these of BONN are learned diversely.
a learning rate schedule that decreases to 10% every 30 epochs. As shown in Table 3.6, our
Bayesian feature loss can further boost the performance of models with real values by a clear
margin. Specifically, our method promotes the performance of ResNet-18 and ResNet-50 by
0.6% and 0.4% Top-1 accuracies, respectively.
TABLE 3.6
Effect of Bayesian feature loss on the ImageNet
data set. The core is ResNet-18 and ResNet-50
with real value.
Model ResNet-18 ResNet-50
Bayesian feature loss
Top-1 69.3 69.9 76.6 77.0
Accuracy
Top-5 89.2 89.8 92.4 92.7
80 Algorithms for Binary Neural Networks
6SDUVH
'HQVH
%DFNWUDFN
5HFXUUHQW0RGXOH
9DOXH
'5H/8
/RFDOPLQLPD
%LOLQHDU&RQVWUDLQW
3UHGLFW
*OREDOPLQLPD
FIGURE 3.26
An illustration of the RBONN framework. Conventional gradient-based algorithms assume
that hidden variables in bilinear models are independent, which causes an insufficient train-
ing of w due to neglecting the relationship with A as shown in the loss surface (right part).
Our RBONN can help w escape from local minima and achieve a better solution.
or
arg min G(w, A) = bw − Aw22 + R(w), (3.121)
w,A
where A = diag( α11 , · · · , α1N ), N is the number of elements in α. ◦ denotes the channel-
wise multiplication, and R(·) represents regularization, typically the norm 1 or 2 . G(w, A)
includes a bilinear form of Aw widely used in the field of computer vision [52, 162, 97].
Note that the bilinear function is Aw rather than G(w, A) in Equation 6.34. Eq. 6.34 is
rational for BNNs with A and w as bilinear coupled variables, since w is the variable and
bw is just the sign of w.
We introduce a recurrent bilinear optimization for binary neural networks (RBONNs) [259]
by learning the coupled scaling factor and real-valued weight end-to-end. More specifically,
recurrent optimization can efficiently backtrack weights, which will be trained more suffi-
ciently than conventional methods. To this end, a Density-ReLU (DReLU) is introduced
to activate the optimization process based on the density of the variable A. In this way,
we achieve a controlled learning process with a backtracking mechanism by considering the
interaction of variables, thus avoiding the local minima and reaching the performance limit
of BNNs, as shown in Fig. 3.26.
However, such bilinear constraints will lead to an asynchronous convergence problem
and directly affect the learning process of A and w. We can know that the variable with a
slower convergence speed (usually w) is not as sufficiently trained as another faster one.
Moreover, BNNs are based on nonconvex optimization and will suffer more from the local
minima problem due to such an asynchronous convergence. A powerful example is that w
will tendentiously fall into the local optimum with low magnitude when the magnitude of
A is much larger than 0 (due to bw ∈ {−1, +1}). On the contrary, w will have a large
magnitude and thus slowly converge when elements of A are close to 0.
RBONN: Recurrent Bilinear Optimization for a Binary Neural Network 81
where λ is the hyper-parameter. G contains the bilinear part as mentioned in Eq. 6.34.
w and A formulate a pair of coupled variables. Thus, the conventional gradient descent
method can be used to solve the bilinear optimization problem as
∂L
At+1 = |At − η1 |, (3.123)
∂At
∂L T ∂LS T ∂G T
( ) =( ) + λ( ) ,
∂At ∂At ∂At
∂LS ∂atout T t
=( t ) + λwt (At wt − bw )T , (3.124)
∂aout ∂At
∂LS t t
= ( t )T (bain bw )(At )−2 + λwt Ĝ(wt , At ),
∂aout
t
where η1 is the learning rate, Ĝ(wt , At ) = (At wt − bw )T . The conventional gradient
descent algorithm for bilinear models iteratively optimizes one variable while keeping the
other fixed. This is a suboptimal solution due to ignoring the relationship between the two
hidden variables in optimization. For example, when w approaches zero due to the sparsity
regularization term R(w), A will have a larger magnitude due to G (Eq. 6.34). Consequently,
both the first and second values of Eq. 6.70 will be dramatically suppressed, causing the
gradient vanishing problem for A. Contrarily, if A changes little during optimization, w will
also suffer from the vanished gradient problem due to the supervision of G, causing a local
minimum. Due to the coupling relationship of w and A, the gradient calculation for w is
challenging.
∂LS ∂G ∂G T ∂At
i,j
w t+1 t
= wi,j − η2 t − η2 λ( t + T r(( ) t )),
∂wi,j ∂wi,j ∂At ∂wi,j
(3.125)
∂At
= t+1
wi,j − η2 λT r(w Ĝ(w , A ) t ),
t t t
∂wi,j
we can derive ⎡ ⎤
w1 ĝ1 ... w1 ĝi ... w1 ĝI
⎢ . . . ⎥
⎢ ⎥
wĜ(w, A) = ⎢
⎢ . . . ⎥ ⎥. (3.127)
⎣ . . . ⎦
wI ĝ1 ... wI ĝi ... wI ĝI
Combining Eq. 3.126 and Eq. 3.127, we get
⎡ ∂A ∂Ai,i
⎤
w1 ĝi ∂wi,1
i,i
... . ... w1 ĝi ∂wi,j
⎢ ⎥
⎢ . . . ⎥
∂A ⎢ ∂Ai,i ∂Ai,i ⎥
wĜ(w, A)( ⎢
)i = ⎢ wi ĝi ∂wi,1 ... . ... wi ĝi ∂wi,J ⎥
∂w ⎥. (3.128)
⎢ . . . ⎥
⎣ ⎦
∂Ai,i ∂Ai,i
wI ĝi ∂wi,1 ... . ... wI ĝi ∂wiJ
After that, the i-th component of the trace item in Eq. 6.72 is then calculated by:
∂A J
∂Ai,i
T r[wĜ( )i ] = wi ĝi (3.129)
∂w j=1
∂w i,j
= wt+1 + η2 λdt wt ,
where η2 is the learning rate of the real value weight filters wi , denotes the Hadamard
J ∂At J ∂At
product. We take dt = −[ĝ1t j=1 ∂wt1,1 , · · · , ĝIt j=1 ∂wti,i ]T , which is unsolvable and un-
1,j I,j
defined in the backpropagation of BNNs. To address this issue, we employ a recurrent model
to approximate dt and have
and
wt+1 ← ŵt+1 , (3.132)
where we introduce a hidden layer with channel-wise learnable weights U ∈ RC
+
out
to recur-
rently backtrack the w. We present DReLU to supervise such an optimization process to
realize a controllable recurrent optimization. Channel-wise, we implement DReLU as
wi if (¬D(wi )) ∧ D(Ai ) = 1,
DReLU (wi , Ai ) = (3.133)
0 otherwise,
RBONN: Recurrent Bilinear Optimization for a Binary Neural Network 83
9: Update A , w , and U
t+1 t+1 t+1
according to Eqs. 6.69, 6.44, and 6.50, respectively.
10: end while
3.8.3 Discussion
In this section, we first review the related methods on “gradient approximation” of BNNs,
then further discuss the difference of RBONN with the related methods and analyze the
effectiveness of the proposed RBONN.
In particular, BNN [99] directly unitizes the Straight-Through-Estimator in the training
stage to calculate the gradient of weights and activations as
∂bwi,j ∂bai,j
= 1|wi,j |<1 , = 1|ai,j |<1 (3.137)
∂wi,j ∂ai,j
which suffers from an obvious gradient mismatch between the gradient of the binarization
function. Intuitively, the Bi-Real Net [159] designs an approximate binarization function
that can help alleviate the gradient mismatch in backward propagation as
⎧
ai,j ⎨1.2 + 2ai,j , −1 ≤ ai,j < 0,
∂b
= 2 − 2ai,j , 0 ≤ ai,j < 1, (3.138)
∂ai,j ⎩
10, otherwise,
84 Algorithms for Binary Neural Networks
FIGURE 3.27
Effect of hyperparameters λ and τ on one- and two-stage training using 1-bit ResNet-18.
which is termed the ApproxSign function and is used for the backpropagation gradient
calculation of the activation. Compared to the traditional STE, ApproxSign has a shape
similar to that of the original binarization function sign, and thus the activation gradi-
ent error can be controlled to some extent. Similarly, CBCN [149] applies an approximate
function to address the gradient mismatch from the sign function. MetaQuant [38] intro-
duces Metalearning to learn the gradient error of weights using a neural network. IR-Net
[196] includes a self-adaptive Error Decay Estimator (EDE) to reduce the gradient error in
training, which considers different requirements on different stages of the training process
and balances the update ability of parameters and reduction of gradient error. RBNN [140]
proposes a training-aware approximation of the sign function for gradient backpropagation.
∂ba ∂bw
In summary, prior art focuses on approximating the gradient derived from ∂a i,j
or ∂w i,j
.
Unlike other approaches, our approach focuses on a different perspective of the gradient
approximation, i.e., gradient from ∂w ∂G
i,j
. Our goal is to decouple A and w to improve the
gradient calculation of w. RBONN manipulates w’s gradient from its bilinear coupling
variable A ( ∂G(A)
∂wi,j ). More specifically, our RBONN can be combined with the prior art by
∂LS ∂LS ∂G
comprehensively considering ∂ai,j , ∂wi,j and ∂wi,j in the backpropagation process.
D :HLJKWRVFLOODWLRQRI5H$FW1HW
E :HLJKWGLVWULEXWLRQRI5H$FW1HW F ,OOXVWUDWLRQRIZHLJKWRVFLOODWLRQ
FIGURE 3.28
(a) We show the epoch-wise weight oscillation of ReActNet. (b) We randomly select two
channels of the first 1-bit layer in ReActNet [158]. The distribution is with three peaks
centering around {−1, 0, +1}, which magnifies the non-parametric scaling factor (red line).
(c) We illustrate the weight oscillation caused by such inappropriate scale calculation, where
w and L indicate the latent weight and network loss function (blue line), respectively.
a result, we apply this set of hyperparameters to the remaining experiments in this chapter.
Note that the recurrent model does not affect when τ is set to 1.
an oscillation from −1 to 1, and from iteration t to t+2 causes an oscillation from 1 to −1.
86 Algorithms for Binary Neural Networks
We further show that the oscillation is factually controlled by the balanced parameter
attached to the reconstruction loss, providing a theoretical foundation for parameterizing
it in backpropagation. The oscillation only occurs when the gradient has a magnitude large
enough to change the sign of the latent weight. Consequently, we calculate the balanced
parameter based on the maximum magnitude of the weight gradient during each iteration,
leading to resilient gradients and effectively mitigating the weight oscillation.
where L(·) represents the training loss. Consequently, a closed-form solution of αn can be
wn 1
derived by channelwise absolute mean (CAM) as αin = i,:,:,:
Mn and M n = Cin
n
× K n × K n.
n n
For ease of representation, we use wi as an alternative to wi,:,:,: in the following. The
latent weight wn is updated using a standard gradient backpropagation algorithm, and its
gradient is calculated as:
∂L ∂L ∂ ŵin ∂L
δwin = = = αin 1|win |≤1 , (3.141)
∂win ∂ ŵin ∂win ∂ ŵin
n
where denotes the Hadmard product and ŵn = αn ◦ bw .
Discussion. Equation (3.141) shows weight gradient mainly comes from the nonparametric
αin and the gradient ∂∂L ∂L
ŵin . ∂ ŵin is automatically solved in backpropagation and becomes
smaller with network convergence; however, αin is often magnified by the trimodal distri-
bution [158]. Therefore, the weight oscillation originates mainly from αin . Given a single
n
weight wi,j (1 ≤ j ≤ M n ) centering around zero, the gradient ∂w ∂L
n is misleading, due to
i,j
n
the significant gap between wi,jn
and αin bwi,j . Consequently, bilevel optimization leads to
frequent weight oscillations. To address this issue, we reformulate traditional bilevel opti-
mization using a Lagrange multiplier and show that a learnable scaling factor is a natural
training stabilizer.
3.9.2 Method
We first give the learning objective as follows:
1
N Cout
n
LR (W, A) = γ n win − αin bwi 22 , (3.143)
2 n=1 i=1 i
ReBNN: Resilient Binary Neural Network 87
in which γin is a balanced parameter. Based on the objective, the weight gradient in
Eq. (3.141) becomes:
∂L n
δwin = n + γin (win − αin bwi )
∂wi
(3.144)
∂L n
= αin ( n 1|win |≤1 − γin bwi ) + γin win .
∂ ŵi
n
The Sin (αin , win ) = γin (win − αin bwi ) is an additional term added in the backpropagation
process. We add this element because too small αin diminishes the gradient δwin and causes
a constant weight win . In what follows, we state and prove the proposition that δwi,j n is
n
a resilient gradient for a single weight wi,j . Sometimes we omit the subscript i, j and the
superscript n for an easy representation.
Proposition 1. The additional term S(α, w) = γ(w − αbw ) achieves a resilient training
process by suppressing frequent weight oscillation. Its balanced factor γ can be considered
the parameter that controls the appearance of the weight oscillation.
Proof: We prove the proposition by contradiction. For a single weight w centering around
zero, the straight-through-estimator 1|w|≤1 = 1. Thus, we omit it in the following. Based
on Eq. (3.144), with a learning rate η, the weight updating process is formulated as:
wt+1 = wt − ηδwt
∂L t
= wt − η[αt ( t
− γbw ) + γwt ]
∂ ŵ
∂L t (3.145)
= (1 − ηγ)w − ηαt (
t
t
− γbw )
∂ ŵ
ηαt ∂L t
= (1 − ηγ) w −t
( − γbw ) ,
(1 − ηγ) ∂ ŵt
where t denotes the t-th training iteration and η represents the learning rate. Different
weights share different distances from the quantization level ±1. Therefore, their gradients
should be modified according to their scaling factors and current learning rate. We first
t
assume the initial state bw = −1, and the analysis process applies to the case of initial
t
state bw = 1. The oscillation probability from iteration t to t + 1 is the following:
t t+1 ∂L
P (bw = bw ) ≤ P( ≤ −γ). (3.146)
bwt =−1 ∂ ŵt
Similarly, the oscillation probability from iteration t + 1 to t + 2 is as follows:
t+1 t+2 ∂L
P (bw = bw ) ≤ P( ≥ γ). (3.147)
bwt+1 =1 ∂ ŵt+1
Thus, the sequential oscillation probability from iteration t to t + 2 is as follows:
t+1 t+2 t+1 t+2
P ((bw = bw ) ∩ (bw = bw ))|bwt =−1
∂L ∂L (3.148)
≤P ( t
≤ −γ) ∩ ( t+1
≥ γ) ,
∂ ŵ ∂ ŵ
which denotes that the weight oscillation occurs only if the magnitudes of ∂∂L ∂L
ŵt and ∂ ŵt+1
are more significant than γ. As a result, its attached factor γ can be considered a
parameter used to control the occurrence of the weight oscillation.
88 Algorithms for Binary Neural Networks
However, if the conditions in Eq. (3.148) are met, with Eq. (3.145) concluded, the gradi-
ent of ŵt+1 is formulated as:
∂L ∂L ∂2L
= − η ≥ γ,
∂ ŵt+1 ∂ ŵt ∂(ŵt )2
(3.149)
∂2L ∂L
η ≤ − γ ≤ −2γ.
∂(ŵt )2 ∂ ŵt
2
Note that η and γ are two positive variables, thus the second-order gradient ∂(∂ŵL
t )2 < 0
holds always. Consequently, L(ŵ ) can only be local maxima rather than a minimum,
t+1
Our proposition and proof reveal that the balanced parameter γ is a “threshold.” A
minimal “threshold” fails to mitigate the frequent oscillation effectively, while a too-large
threshold suppresses the necessary sign inversion and hinders the gradient descent process.
To solve this, we devise the learning rule of γ as:
1 n,t n,t+1 ∂L
γin,t+1 = n
bwi bwi − 10 · max n (| n,t |), (3.150)
M 1≤j≤M ∂ ŵi,j
n,t n,t+1
where the first element M1n bwi bwi − 10 denotes the proportion of weights with
n,t |) is derived from Eq. (3.148), denoting
∂L
change of sign. The second item max1≤j≤M (| ∂ ŵ
n
i,j
the gradient with the greatest magnitude of the t-th iteration. In this way, we suppress the
frequent weight oscillation with a resilient gradient.
We further optimize the scaling factor as follows:
∂L ∂LR
δαni = n + . (3.151)
∂αi ∂αin
The gradient derived from the softmax loss can be easily calculated based on backprop-
agation. Based on Eq. (6.88), it is easy to derive:
∂LR n n
(a). ReActNet
(b). ReBNN
FIGURE 3.29
The evolution of latent weight distribution of (a) ReActNet and (b) ReBNN. We select
the first channel of the first binary convolution layer to show the evolution. The model is
initialized from the first stage training with W32A1 following [158]. We plot the distribution
every 32 epochs.
sign flip, thus hindering the training. Inspired by this, we use Eq. (3.150) to calculate γ
and improve performance by 0.6%, showing that considering the proportion of the weight
oscillation allows for the necessary sign flip and leads to more effective training. We also
show the training loss curves in Fig. 3.30(b). As plotted, the L curves almost demonstrate
the training sufficiency degrees. Therefore, we conclude that ReBNN with γ calculated by
Eq. (3.150) achieves the lowest training loss and an efficient training process. Note that the
loss may not be minimal at each training iteration. Still, our method is just a reasonable
version of gradient descent algorithms, which can be used to solve the optimization prob-
lem as a general one. We empirically prove ReBNN’s capability of mitigating the weight
oscillation, leading to better convergence.
Resilient training process: This section shows the evolution of the latent weight distri-
bution. We plot the distribution of the first binary convolution layer’s first channel per 32
epochs in Fig. 3.29. As seen, our ReBNN can efficiently redistribute the BNNs toward re-
silience. Conventional ReActNet [158] possesses a tri-model distribution, which is unstable
due to the scaling factor with large magnitudes. In contrast, our ReBNN is constrained by
the balanced parameter γ during training, thus leading to a resilient bi-modal distribution
with fewer weights centering around zero. We also plot the ratios of sequential weight os-
cillation of ReBNN and ReActNet for the 1-st, 8-th, and 16-th binary convolution layers
TABLE 3.7
We compare different calculation
methods of γ, including constants that
vary from 0 to 1e−2 and
gradient-based calculation.
Value of γ Top-1 Top-5
0 65.8 86.3
1e−5 66.2 86.7
1e−4 66.4 86.7
1e−3 66.3 86.8
1e−2 65.9 86.5
n,t |)
∂L
max1≤j≤M n (| ∂ ŵ 66.3 86.2
i,j
Eq. (3.150) 66.9 87.1
90 Algorithms for Binary Neural Networks
D (SRFKZLVHRVFLOODWLRQUDWLR E (SRFKZLVHWUDLQLQJORVV
FIGURE 3.30
(a) Epoch-wise weight oscillation ratio of ReActNet (solid), ReCU (dotted), and ReBNN
(dashed). (b) Comparing the loss curves of ReActNet and our ReBNN with different calcu-
lations of γ.
of ResNet-18. As shown in Fig. 3.30(a), the dashed lines gain much lower magnitudes than
the solid (ReActNet) and dotted (ReCU [267]) lines with the same color, validating the
effectiveness of our ReBNN in suppressing the consecutive weight oscillation. Besides, the
sequential weight oscillation ratios of ReBNN are gradually decreased to 0 as the training
converges.
4
Binary Neural Architecture Search
4.1 Background
Deep convolutional neural networks (DCNNs) have dominated as the best performers on
various computer vision tasks such as image classification [84], instance segmentation [163],
and object detection [220] due to the great success of deep network architecture design.
With the increasing demand for architecture engineering, instead of designing complex
architectures manually, neural architecture search (NAS) is among the best approaches for
many tasks by generating delicate neural architectures.
Thanks to the rapid development of deep learning, significant gains in performance have
been realized in a wide range of computer vision tasks, most of which are manual-designed
network architectures [123, 211, 84, 92]. The neural architecture search (NAS) approach
has recently attracted increased attention. The goal is to find automatic ways to design
neural architectures to replace conventional hand-crafted ones. Existing NAS approaches
need to explore a huge search space and can be roughly divided into three approaches:
evolution-based, reinforcement-learning-based, and one-shot-based.
To implement the architecture search within a short period, researchers try to reduce
the cost of evaluating each searched candidate. Early efforts include sharing weights be-
tween searched and newly generated networks [27]. Later, this method was generalized to
a more elegant framework called one-shot architecture search [20, 28, 151, 188, 254]. In
these approaches, an over-parameterized network or super-network covering all candidate
operations is trained only once, and the final architecture is obtained by sampling from this
super-network. For example, [20] trained the overparameterized network using a Hyper-
Net [81], and [188] proposed to share parameters among Child models to avoid retraining
each candidate from scratch. DARTS [151] introduces a differentiable framework and thus
combines the search and evaluation stages into one. Despite its simplicity, researchers have
found some drawbacks and proposed improved approaches over DARTS [254, 39]. PDARTS
[39] presents an efficient algorithm that allows the depth of searched architectures to grow
gradually during the training procedure, significantly reducing search time. ProxylessNAS
[29] adopted the differentiable framework and proposed to search architectures on the target
task instead of adopting the conventional proxy-based framework.
Binary neural architecture search replaces the real-valued weights and activations with
binarized ones, which consumes much less memory and computational resources to search
binary networks and provides a more promising way to efficiently find network architec-
tures. These methods can be categorized into direct binary architecture search and auxiliary
binary architecture search. Direct binary architecture search yields binary architectures di-
rectly from well-designed binary search spaces. As the first art in this field, BNAS1 [36]
effectively reduces search time by channel sampling and search space pruning in the early
training stages for a differentiable NAS. BNAS2 [114] utilizes diversity in the early search
to learn better performing binary architectures. BMES [189] learns an efficient binary Mo-
bileNet [90] architecture through evolution-based search. However, the accuracy of the direct
DOI: 10.1201/9781003376132-4 91
92 Binary Neural Architecture Search
binary architecture search can be improved by the auxiliary binary architecture search [24].
BATS [24] designs a new search space specially tailored for the binary network and incor-
porates it into the DARTS framework.
Unlike the aforementioned methods, our work is driven by the performance discrepancy
between the 1-bit neural architecture and its real-valued counterpart. We introduce tangent
propagation to explore the accuracy discrepancy and further accelerate the search process
by applying the GGN to the Hessian matrix in optimization. Furthermore, we introduce a
novel decoupled optimization to address asynchronous convergence in such a differentiable
NAS process, leading to better performed 1-bit CNNs. The overall framework leads to a
novel and effective BNAS process.
To introduce the advances of the NAS area, we separately introduce the representative
works in the NAS and binary NAS in the following.
Anti-Bandit LCB
Sampling operations 2logN
sL (ok(i, j) ) = mk,t
(i, j)
(i, j)
Reducing the search space
nk,t
Bi
CONV CONV Depth-Wise MAX POOL
Identity
5x5 3x3 CONV 3x3 3x3
K ¦m 1
M
muv
CONV CONV Depth-Wise MAX POOL
Identity
5x5 3x3 CONV 3x3 3x3
CONV CONV Depth-Wise MAX POOL
Identity
5x5 3x3 CONV 3x3 3x3
Anti-Bandit UCB
( K 1)¦ m 1
M
muv
Bj 2logN ȳሺ݅ǡ݆ ሻ
sU (ok(i, j) ) = mk,t
(i, j)
+
(i, j)
nk,t
FIGURE 4.1
ABanditNAS is divided into two steps: sampling using LCB and abandoning using UCB.
they are confirmed to be bad. Meanwhile, when well trained, weight-free operations will
be compared only with parameterized operations. On the other hand, with the operation
pruning process, the search space becomes smaller and smaller, leading to an efficient search
process.
2 ln N
δ̃k = , (4.3)
n
where N is the total number of trails.
Proof. Suppose X ∈ [0, 1] represents the theoretical value of each independently distributed
and pi is the actual
operation. n is the number of times the arm has been played up to trial,
p
value of the operation in the ith trail. Furthermore, we define p = ni i and q = 1 − p.
Since the variance boundary of independent operations can represent the global variance
boundary (see the Appendix), based on Markov’s inequality, we can arrive at below :
P [X > p + δ] = P [ (Xi − pi ) > δ]
i
λ i (Xi −pi )
= P [e > eλδ ] (4.4)
E[e λ i (Xi −pi ) ]
≤ .
eλδ
Since we can get 1 + x ≤ ex ≤ 1 + x + x2 when 0 ≤ |x| ≤ 1), E[eλ i (Xi −pi ) ] in Eq. 4.4
can be further approximated as follows:
!
E[eλ i (Xi −pi ) ] = E[eλ(Xi −pi ) ]
i
!
≤ E[1 + λ(Xi − pi ) + λ2 (Xi − pi )2 ]
i (4.5)
!
= (1 + λ2 vi2 )
i
2 2
≤ eλ v
,
where v denotes the variance of X. Combining Eq. 4.4 and Eq. 4.5 gives P [X > p +
λ2 v 2
δ] ≤ eeλδ . Since λ is a positive constant, it can be obtained by the transformation of the
2
values P [X > p + δ] ≤ e−2nδ . According to the symmetry of the distribution, we have
2
P [X < p − δ] ≤ e−2nδ . Finally, we get the following inequality:
2
P [|X − p| ≤ δ] ≥ 1 − 2e−2nδ . (4.6)
According to Eq. 4.6, the variance in the anti-bandit algorithm is bounded, and the
lower/upper confidence bounds can be estimated as
2 ln N 2 ln N
r˜k − ≤ rk ≤ r˜k + . (4.8)
n n
Following [307, 151, 291], we search for computation cells as the building blocks of the
final architecture. A cell is a fully connected directed acyclic graph (DAG) of M nodes, i.e.,
{B1 , B2 , ..., BM } as shown in Fig. 4.13. Here, each node is a specific tensor (e.g., a feature
map in convolutional neural networks), and each directed edge (i, j) between Bi and Bj
(i,j) (i,j)
denotes an operation o(i,j) (.), which is sampled from Ω(i,j) = {o1 , ..., oK }. {Ω(i,j) } is the
search space of a cell. Each node Bj takes its dependent nodes as input and can be obtained
by Bj = Σi<j o(i,j) (Bi ). The constraint i < j here is to avoid cycles in a cell. Each cell takes
the output of the last cell as input. For brevity, we denote by B0 the last node of the
previous cell and the first node of the current cell. Unlike existing approaches that use only
normal and reduction cells, we search for v (v > 2) cells instead. For general NAS search, we
follow [151] and take seven normal operations, i.e., 3 × 3 max pooling, 3 × 3 average pooling,
skip connection (identity), 3 × 3 convolution with rate 2, 5 × 5 convolution with rate 2, 3 × 3
depth-wise separable convolution, and 5 × 5 depth-wise separable convolution. Considering
adversarially robust optimization for NAS, we introduce two additional operations, the 3×3
Gabor filter and denoising block, for model defense. Therefore, the size of the entire search
space is K |EM |×v , where EM is the set of possible edges with M intermediate nodes in the
fully connected DAG. In the case with M = 4 and v = 6, together with the input node, the
total number of cell structures in the search space is 9(1+2+3+4)×6 = 910×6 . Here, we briefly
introduce the two additional operations.
Gabor filter. Gabor filters [69, 68] containing frequency and orientation representations
can characterize the spatial frequency structure in images while preserving spatial relation-
ships. This operation provides superb robustness for the network [191]. Gabor filters are de-
2 2 2
fined as: exp(− x +γ 2σ 2
y
) cos(2π xλ +ψ). Here, x = x cos θ+y sin θ and y = −x sin θ+y cos θ.
σ, γ, λ, ψ, and θ are learnable parameters. Note that the symbols used here apply only
to the Gabor filter and are different from the symbols used in the rest of this chapter.
Figure 4.2(b) shows an example of Gabor filters.
Denoising block. As described in [253], adversarial perturbations on images will intro-
duce noise in the features. Therefore, denoising blocks can improve adversarial robustness by
denoising features. Following this, we add the nonlocal mean denoising block [22] as shown
in Fig. 4.2(c) to the search space to denoise the features. Calculate a denoised feature map
z of an input feature map x by taking a weighted mean of the spatial locations of the fea-
tures in general L as zp = C(x) 1
∀q∈L f (xp , xq ) · xq , where f (xp , xq ) is a feature-dependent
weighting function and C(x) is a normalization function.
FIGURE 4.2
(a) A cell containing four intermediate nodes B1 , B2 , B3 , B4 that apply sampled operations
on the input node B0 . B0 is from the output of the last cell. The output node concatenates
the outputs of the four intermediate nodes. (b) Gabor Filter. (c) A generic denoising block.
Following [253], it wraps the denoising operation with a 1 × 1 convolution and an identity
skip connection [84].
we progressively abandon the worst-performing operation and sample the operations with
little expectations but a significant variance for each edge. Unlike [291], which uses the
performance as the evaluation metric to decide which operation should be pruned, we use
the anti-bandit algorithm described in Section 4.2.1 to make a decision.
Following UCB in the bandit algorithm, we obtain the initial performance for each
operation on every edge. Specifically, we sample one of the K operations in Ω(i,j) for every
(i,j)
edge, then obtain the validation accuracy a, which is the initial performance mk,0 by
adversarially training the sampled network for one epoch and finally assigning this accuracy
to all the sampled operations.
By considering the confidence of the kth operation using Eq. 4.8, the LCB is calculated
by
(i,j) (i,j) 2 log N
sL (ok ) = mk,t − (i,j)
, (4.9)
nk,t
(i,j)
where N is the total number of samples, nk,t refers to the number of times the kth operation
of the edge (i, j) has been selected and t is the epoch index. The first item in Eq. 4.9 is the
value term (see Eq. 4.2) which favors the operations that look good historically, and the
second is the exploration term (see Eq. 4.3) which allows operations to get an exploration
bonus that grows with log N . The selection probability for each operation is defined as
(i,j)
(i,j) exp{−sL (ok )}
p(ok )= (i,j)
. (4.10)
m exp{−sL (om )}
The minus sign in Eq. 4.10 means that we prefer to sample operations with a smaller
(i,j)
confidence. After sampling one operation for every edge based on p(ok ), we obtain the
validation accuracy a by training adversarially the sampled network for one epoch, and then
(i,j)
update the performance mk,t that historically indicates the validation accuracy of all the
(i,j)
sampled operations ok as
(i,j) (i,j)
mk,t = (1 − λ)mk,t−1 + λ ∗ a, (4.11)
where λ is a hyperparameter.
ABanditNAS: Anti-Bandit for Neural Architecture Search 97
The operation with minimal UCB for every edge is abandoned. This means that operations
that are given more opportunities but result in poor performance are removed. With this
pruning strategy, the search space is significantly reduced from |Ω(i,j) |10×6 to (|Ω(i,j) | −
1)10×6 , and the reduced space becomes
(i,j)
Ω(i,j) ← Ω(i,j) − {arg min sU (ok )}, ∀(i, j). (4.13)
(i,j)
ok
The reduction procedure is repeated until the optimal structure is obtained, where only one
operation is left on each edge.
Complexity Analysis. There are O(K |EM |×v ) combinations in the search space dis-
covery process with v types of different cells. In contrast, ABanditNAS reduces the search
space for every K ∗ T epoch. Therefore, the complexity of the proposed method is the
following.
K
O(T × k) = O(T K 2 ). (4.14)
k=2
The goal of adversarial training [167] is to learn networks that are robust to adversarial
attacks. Given a network fθ parameterized by θ, a dataset (xe , ye ), a loss function l and
a threat modelΔ, the learning
problem can be formulated as the following optimization
problem: minθ e maxδ∈Δ l fθ (xe + δ), ye , where δ is the adversarial perturbation. In this
chapter, we consider the typical l∞ threat model [167], Δ = {δ : δ∞ ≤ } for some > 0.
Here, · ∞ is the l∞ norm distance metric and is the adversarial manipulation budget.
The adversarial training procedure uses attacks to approximate inner maximization over
Δ, followed by some variation of gradient descent on model parameters θ. For example,
one of the earliest versions of adversarial training uses the Fast Gradient Sign Method
(FGSM) [75] to approximate the inner maximization. This could be seen as a relatively
inaccurate approximation of inner maximization for l∞ perturbations, and it has the closed-
form solution: θ = · sign ∇x l f (x), y . A better approximation of inner maximization
is to take multiple smaller FGSM steps of size α instead. However, the number of gradient
computations caused by the multiple steps is proportional to O(EF ) in a single epoch, where
E is the size of the data set and F is the number of steps taken by the adversary PGD.
This is F times higher than standard training with O(E) gradient computations per epoch,
and adversarial training is typically F times slower. To accelerate adversarial training, we
combine FGSM with random initialization [247] for our ABanditNAS. Our ABanditNAS
with adversarial training is summarized in Algorithm 8.
4.2.5 Analysis
Effect on the hyperparameter λ. The hyper-parameter λ balances the performance
between the past and the current. Different values of λ result in similar search costs. The
performance of the structures searched by ABanditNAS with different values of λ is used
98 Binary Neural Architecture Search
to find the best λ. We train the structures in the same setting. From Fig. 4.3, we can see
that when λ = 0.7, ABanditNAS is most robust.
Effect on the search space. We test the performance of ABanditNAS with different
search spaces. In this part, we adopt the same experimental setting as the general NAS. The
search space of the general NAS has 7 operations. We incrementally add the Gabor filter,
denoising block, 1×1 dilated convolution with rate 2 and 7×7 dilated convolution with rate
2, until the number of operations in the search space reaches 11. In Table 4.1, # Search Space
represents the number of operations in the search space. Although the difficulty of searching
increases with increasing search space, ABanditNAS can effectively select the appropriate
operations. Each additional operation has little effect on search efficiency, demonstrating
the efficiency of our search method. When the number of operations in the search space is 9,
the classification accuracy of the model searched by ABanditNAS exceeds all the methods
with the same level of search cost.
FIGURE 4.3
Performances of structures searched by ABanditNAS with different hyper-parameter values
λ.
savings, which is widely considered as one of the most efficient ways to perform computing
on embedded devices with low computational cost. In [199], the XNOR network is presented,
where the weights and inputs attached to the convolution are approximated with binarized
values. This efficiently implements convolutional operations by reconstructing the unbina-
rized filters with a single scaling factor. In [77], a projection convolutional neural network
(PCNN) is proposed to implement binarized neural networks (BNNs) based on a simple
back-propagation algorithm. [287] proposes Bayesian optimized 1-bit CNNs, taking advan-
tage of Bayesian learning to significantly improve the performance of extreme 1-bit CNNs.
Binarized models show advantages in reduction in computational cost and memory savings.
However, they suffer from poor performance in practical applications. There still remains a
gap between 1-bit weights/activations and full-precision counterparts, which motivates us
to explore the potential relationship between 1-bit and full-precision models to evaluate bi-
narized networks performance based on NAS. This section introduces a Child-Parent model
to efficiently search for a binarized network architecture in a unified framework.
The search strategy for the Child-Parent model consists of three steps shown in Fig. 4.4.
First, we sample the operations without replacement and construct two classes of subnet-
works that share the same architecture, i.e., binarized networks (child) and full-precision
networks (parent). Second, we train both subnetworks and obtain the performance indicator
of the corresponding operations by calculating the child network accuracy and the accuracy
TABLE 4.1
The performance of ABanditNAS with different search spaces on CIFAR10.
# Search Accuracy # Params Search Cost Search
Architecture
Space (%) (M) (GPU days) Method
ABanditNAS 7 97.13 3.0 0.09 Anti-Bandit
ABanditNAS 8 97.47 3.3 0.11 Anti-Bandit
ABanditNAS 9 97.52 4.1 0.13 Anti-Bandit
ABanditNAS 10 97.53 2.7 0.15 Anti-Bandit
ABanditNAS 11 97.66 3.7 0.16 Anti-Bandit
100 Binary Neural Architecture Search
0 0 0 0
A
Sample without A
1 B Replacement & Train 1 1 B Reduce 1
Compute C search space
D
C D
2 2 2 E 2
E
F
F K Times
Minimum
3 3 3 E P (AP - AC )+ AC 3
Child / Parent
FIGURE 4.4
The main framework of the proposed Child-Parent search strategy. In a loop, we first sample
the operation without replacement for each edge of the search space and then train the child
and parent models generated by the same architecture simultaneously. Second, we use the
Eqs. 4.15 and 4.28 to compute the evaluation indicator calculated by the accuracy of both
models on the validation data set. Until all operations are selected, we remove the operation
on each edge with the worst performance.
loss between child and parent networks. It is observed that the worst operations in the early
stage usually have worse performance in the end. On the basis of this observation, we then
remove the operation with the worst performance according to the performance indicator.
This process is repeated until only one operation is left on each edge. We reformulate the
traditional loss function as a kernel-level Child-Parent loss for binarized optimization of
child-parent model.
Performance evaluation
MSE loss
AP AC
Reduce Parent weights Child weights
Parent Child
FIGURE 4.5
The main framework of the Child-Parent model. The Child-Parent model focuses on bina-
rized architecture search (left) and binarized optimization (right).
where AP,t and AC,t represents the network performance calculated by the accuracy of the
full-precision model (Parent) and the binarized model (Child) on the validation dataset, and
βP is the hyperparameter to control performance loss. i,j represents the index of the node
to generate the edge (i, j) shown in Fig. 4.6, k is the operation index of the corresponding
edge and t represents the tth sampling process. Note that we used the performance of the
sampled network to evaluate the performance of the corresponding selected operations.
CP-NAS [304] not only uses the accuracy on the validation dataset to guide the search
process directly but also considers the information of the full-precision model to investigate
better the full potential of the binarized model that can ultimately be reached. Additional
details are provided in the following section.
As shown in Fig. 4.5, unlike the traditional teacher-student model [87], which transfers
the generalization ability of the first model to a smaller model by using the class proba-
bilities as “soft targets,” the child-parent model focuses on the performance measure that
is particularly suitable for NAS-based network binarization. Furthermore, the loss function
for the teacher-student model is constrained to the feature map or the output, while ours
focuses on the kernel weights.
N1
N -1 N2
B-1
N3 Output
N0
N4
FIGURE 4.6
The cell architecture for CP-NAS. A cell includes 2 input nodes, 4 intermediate nodes, and
14 edges.
102 Binary Neural Architecture Search
+
b
B-1
+
Input Output
Dil-Conv Dwise-Conv
Conv
B-1
1-bit
FIGURE 4.7
The operations of each edge. Each edge has 4 convolutional operations, including 2 types of
binarized convolution with 3 ∗ 3 or 5 ∗ 5 receptive fields and 4 non-convolutional operations.
Input Input
channel=M channel=M
FIGURE 4.8
Compared to the origin separable convolution in depth (left), a new binarized separable
convolution in depth is designed for CP-NAS (right).
CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs 103
(i,j) (i,j)
where z̄k = T1 t zk,t . Along with increasing epochs, we progressively abandon the worst
evaluation operation from each edge until there is only one operation for each edge.
where λ is a hyperparameter to balance the two terms. Hcl is the cth full-precision filter of
the lth convolutional layer and Ĥcl denotes its corresponding reconstructed filter; MSE(·)
represents the mean square error (MSE) loss. The second term minimizes the intraclass
compactness since the binarization process causes feature variations. fC,s (Ĥ) denotes the
feature map of the last convolutional layer for the sth sample, and f C,s (Ĥ) denotes the
class-specific mean feature map for the corresponding samples. Combining LĤ with the
conventional loss LCE , we obtain the final loss:
The L and its derivatives are easily calculated directly using the efficient automatic
derivatives package.
94 94
93
Accuracy (%)
Accuracy (%)
93
92
92
91
91 90
90 89
βP Search strategy
FIGURE 4.9
The result (right) for different βP on CIFAR-10. The 1-bit CNNs result (left) for different
search strategies on CIFAR-10, including random search, PC (PC-DARTs), BNAS†, CP-
NAS. We approximately implement BNAS† by setting βP as 0 in CP-NAS, which means
that we only use the performance measure for the operation selection.
DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs 105
ߙ ߙො
ߙොଶ ݂ሺߙሻ ߙොଶכ
ߙොଵ
ߙොଵכ
ߙଵ כൌ ߙොଵ ߙଶ כൌ ߙොଶ ߙ ߙଵߙ כොଵכ ߙොଶߙ כଶכ ߙ
FIGURE 4.10
Motivation for DCP-NAS. We first show directly binarizing real-valued architecture to 1-
bit is sub-optimal. Thus we use tangent propagation (middle) to find an optimized 1-bit
neural architecture along the tangent direction, leading to a better-performed 1-bit neural
architecture.
4.4.1 Preliminary
Neural architecture search Given a conventional CNN model, we denote w ∈ W and
W = RCout ×Cin ×K×K and ain ∈ RCin ×W ×H as its weights and feature maps in the specific
layer. Cout and Cin represent the output and input channels of the specific layer. (W, H) is
the width and height of the feature maps and K is the kernel size. Then we have
Parent Child
Parent Child
FIGURE 4.11
The main framework of the proposed DCP-NAS, where α and α̂ denote real-valued and
binary architecture, respectively. We first conduct the real-valued NAS in a single round
and generate the corresponding tangent direction. Then we learn a discrepant binary ar-
chitecture via tangent propagation. In this process, real-valued and binary networks inherit
architectures from their counterparts, in turn.
where ⊗ is the convolution operation. We omit the batch normalization (BN) and activation
layers for simplicity. Based on this, a normal NAS problem is given as
N (4.21)
=− pn (X ) log(pn (w, α)),
n=1
where N denotes the number of classes and X is the input data. f˜(w, α) represents the
performance of a specific architecture with real value weights, where pn (X ) and pn (w, α)
denote the true distribution and the distribution of network prediction, respectively.
Binary neural architecture search The 1-bit model aims to quantize ŵ and âin into
bŵ ∈ {−1, +1}Cout ×Cin ×K×K and bâin ∈ {−1, +1}Cin ×H×W using the efficient XNOR and
Bit-count operations to replace full precision operations. Following [48], the forward process
of the 1-bit CNN is
âout = β ◦ bâin bŵ , (4.22)
where is the XNOR, and bit count operations and ◦ denotes channelwise multiplication.
β = [β1 , · · · , βCout ] ∈ R+
Cout is the vector consisting of channel-wise scale factors. b =
sign(·) denotes the binarized variable using the sign function, which returns one if the
input is greater than zero and −1 otherwise. It then enters several non-linear layers, e.g.,
DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs 107
BN layer, non-linear activation layer, and max-pooling layer. We omit these for the sake
of simplification. Then, the output âout is binarized to bâout by the sign function. The
fundamental objective of BNNs is to calculate ŵ. We want it to be as close as possible before
and after binarization to minimize the binarization effect. Then, we define the reconstruction
error following [77] as
LR (ŵ, β) = ŵ − β ◦ bŵ 22 . (4.23)
Based on the above derivation, the vanilla direct BNAS [36, 114] can be defined as
where bŵ = sign(ŵ) is used for inference and α̂ is a neural architecture with binary weights.
Prior to direct BNAS [36] learning the BNAS from such an objective as
N
max
+
f˜b (ŵ, α̂, β) = p̂n (ŵ, α̂, β) log(p̂n (X )), (4.25)
ŵ∈W,α̂∈A,β∈R
n=1
where we use notations similar to those of Eq. 4.21. Equation 4.25 means that the vanilla
direct BNAS only focuses on the binary search space under the supervision of cross-entropy
loss, which is less effective due to the search process being not exhaustive [24].
ŵ∗ , α̂∗ , β ∗ = argmin LCP-NAS (f˜P (w, α), f˜bC (ŵ, α̂, β))
ŵ∈Ŵ,α∈A,β∈R+
(4.26)
= argmin f˜P (w, α) − f˜bC (ŵ, α̂, β),
ŵ∈Ŵ,α∈A,β∈R+
where f˜P (w, α) denotes the performance of the real-valued parent model as predefined in
N
Eq. 4.21. f˜bC is further defined as f˜bC (ŵ, α, β) = n=1 p̂n (ŵ, α, β) log (p̂n (X )) following
Eq. 4.25. As shown in Eq. 4.26, we propose L to estimate the performance of candidate
108 Binary Neural Architecture Search
)XUYYKTZXUV_ 12JO\KXMKTIK
*KIU[VRKJ *KIU[VRKJ
ࢌ෨ ࡼ ሺܟǡ ߙሻ ߙ ՚ ߙො ෪ ܊ሺܟǡ
ࢌ ෝ ߙǡ
ො ߚሻ
UVZOSO`GZOUT UVZOSO`GZOUT
:GTMKTZ
6GXKTZ VXUVGMGZOUT )NORJ
ࡸࡾ
9KGXIN9VGIK
FIGURE 4.12
The main framework of the Discrepant Child-Parent model. In orange, we show the critical
novelty of DCP-NAS, i.e., tangent propagation and decoupled optimization.
architectures with binarized weights and activations, which consider both real-valued archi-
tectures and binarized architectures.
where i is the dependent nodes of j with the constraints i < j to avoid cycles in a cell,
and aj is the output of the node j. w(i,j) denotes the weights of the convolution operation
between the i-th and j-th nodes, and ⊗ denotes the convolution operation. Each node is a
specific tensor like a feature map, and each directed edge (i, j) denotes an operation o(i,j) (.),
which is sampled from the following M = 8 operations:
FIGURE 4.13
The cell architecture for DCP-NAS. One cell includes 2 input nodes, 4 intermediate nodes,
and 14 edges.
DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs 109
We replace the separable convolution depth-wise with a binarized form, i.e., binarized
weights and activations. Skip connection is an identity mapping in NAS, instead of an
additional shortcut. Optimizing BNNs is more challenging than conventional CNNs [77, 199],
as binarization adds additional burdens to NAS. Following [151], to reduce the undesirable
fluctuation in performance evaluation, we normalize the architecture parameter of the M
operations for each edge to obtain the final architecture indicator as
(i,j)
exp{αm }
ô(i,j) (j)
m (a ) = (i,j)
o(i,j)
m (a )
(j)
(4.28)
m exp{αm }
f˜P (w, α)
max G(ŵ, α̂, β) = f˜P (w, α)log C
ŵ∈W,α̂∈A,β∈R+ f˜b (ŵ, α̂, β)
(4.30)
N
p̂n (ŵ, α̂, β)
= pn (w, α) log( ),
n=1
pn (w, α)
where the KL divergence is used to supervise the binary search process. G(ŵ, α̂, β) calculates
the similarity of the output logits between the real value network p(·) and the binary network
p̂(·), where the teacher’s output is already given.
To further optimize the binary architecture, we constrain the gradient of binary NAS
using the tangent direction as
We use Eqs. 4.30 – 4.31 to jointly learn the DCP-NAS and rewrite the objective function
in Eq. 4.26 as
LDCP-NAS (f˜P (w, α), f˜bC (ŵ, α̂, β))
(4.32)
= −G(ŵ, α̂, β) + λD(α̂) + μLR (ŵ, β).
Then we optimize the binary architecture α̂ along the tangent direction of the real-valued
model, which inherits from the real-valued one. Note that when we set λ = 0, the Eq.
4.32 is equivalent to the objective of the original CP-NAS [304]. As revealed in [157], the
real-valued weights converge faster than the binarized ones. Motivated by this observation,
the tangent direction of the Parent supernet can be used to approximate the optimization
direction of the more slowly converged Child supernet. To conclude, in Eq. 4.31, we improve
the optimization of the Child architecture based on the tangent direction of the Parent
architecture, which leads the Child supernet to be trained more efficiently.
Considering that binary weights are learned by KL divergence, we optimize our DCP-
NAS as
In the following, we prove that the Hessian matrix of the loss function is directly related
to the expectation of the covariance of the gradient. Taking the loss function as the negative
logarithm of the likelihood, let X be a set of input data from the network and p(X ; ŵ, α̂)
be the predicted distribution on X under the parameters of the network are ŵ and α̂, i.e.,
output logits of the head layer.
By omitting ŵ for simplicity, Fisher’s information on the set of probability distributions
P = {pn (X ; α̂), n ∈ N } can be described by a matrix whose value in the i-th row and the
j-th column.
∂ log pn (X ; α̂) ∂ log pn (X ; α̂)
Ii,j (α̂) = EX [ ]. (4.36)
∂ α̂i ∂ α̂j
Recall that N denotes the number of classes described in Eq. 4.21. It is then trivial to prove
that the Fisher information of the probability distribution set P approaches a scaled version
of the Hessian of log-likelihood as
∂ 2 log pn (X ; α̂)
Ii,j (α̂) = −EX [ ]. (4.37)
∂ α̂i ∂ α̂j
2
Let Hi,j denote the second-order partial derivatives ∂ α̂∂i ∂ α̂j . Note that the first derivative
of log-likelihood is
∂ log pn (X ; α̂) ∂pn (X ; α̂)
= , (4.38)
∂ α̂i pn (X ; α̂)∂ α̂i
The second derivative is
Hi,j pn (X ; α̂) ∂pn (X ; α̂) ∂pn (X ; α̂)
Hi,j log pn (X ; α̂) = − . (4.39)
pn (X ; α̂) pn (X ; α̂)∂ α̂i pn (X ; α̂)∂ α̂j
Considering that
Hi,j pn (X ; α̂) Hi,j pn (X ; α̂)
EX ( )= pn (X ; α̂)dX
pn (X ; α̂) pn (X ; α̂)
(4.40)
= Hi,j pn (X ; α̂)dX = 0,
we take the expectation of the second derivative and then obtain the following.
∂pn (X ; α̂) ∂pn (X ; α̂)
EX (Hi,j log pn (X ; α̂)) = −EX { }
pn (X ; α̂)∂ α̂i pn (X ; α̂)∂ α̂j
(4.41)
∂pn (X ; α̂) ∂pn (X ; α̂)
= −EX { }.
∂ α̂i ∂ α̂j
Thus, an equivalent substitution for the Hessian matrix Hf˜b (α̂) in Eq. 4.32 is the product
of two first-order derivatives. This concludes the proof that we can use the covariance of
gradients to represent the Hessian matrix for efficient computation.
FIGURE 4.14
The loss landscape illustration of supernet. (a) The gradient of current weights with different
α, (b) the vanilla αt+1 with backpropagation, (c) α̃t+1 with the decoupled optimization.
(i,j)
a(j) = i<j softmax(αm )(w
(i,j)
⊗ a(i) ), where w(i,j) = [[wm ]] ∈ RM ×1 , wm ∈
RCout ×Cin ×Km ×Km denotes the weights of all candidate operations between the i-th and
j-th nodes and Km denotes the kernel size of the m-th operation. Specifically, for pooling
and identity operations, Km equals the downsample size and the size of the feature map,
wm equals 1/(Km × Km ) and 1, respectively. For each intermediate node, its output a(j) is
(i,j) (i,j) (i,j) (i,j)
jointly determined by αm and wm , while a(i) is independent of both αm and wm . As
shown in Figs. 4.14 (a) and (b), with different α, the gradient of the corresponding w can
be varied and sometimes difficult to optimize, possibly trapped in local minima. However,
by decoupling α and w, the supernet can jump out of the local minima and be optimized
with better convergence.
Based on the deviation and analysis above, we propose our objective for optimizing the
neural architecture search process
LNAS + reg(w), for Parent model
arg min L(w, α) = (4.42)
α,w LDCP-NAS + reg(w), for Child model
and with rewritten α as a column vector [α1 , αe , · · · , αE ] with each αe is a row vector, we
T
have ⎡ ⎤
α1 g̃1 ... α1 g̃e ... α1 g̃E
⎢ . . . ⎥
⎢ ⎥
αL̃ = ⎢ α
⎢ e 1g̃ ... α g̃
e e ... α ⎥
e g̃E ⎥ . (4.47)
⎣ . . . ⎦
αE g̃1 ... αE g̃e ... αE g̃E E×E
Combing Eq. 4.46 and Eq. 4.47, the matrix in the trace item of Eq. 4.44 can be written as
⎡ E ⎤
∂wm
0 ... α1 e =1 g̃e ∂α ... 0
⎢ e ,m
⎥
⎢. . .⎥
∂w ⎢ E ⎥
αL̃( )m = ⎢
⎢ 0 ... α e
∂wm
e =1 g̃e ∂αe ,m ... 0⎥⎥ . (4.48)
∂α ⎢. ⎥
⎣ E
. .⎦
∂wm
0 ... αE e =1 g̃e ∂α
... 0
e ,m E×M
∂w M E
wm
T r[αL̃( )]e = αe g̃e (4.49)
∂α m=1
∂α e ,m
e =1
t
Noting that in the vanilla propagation process, αt+1 = αt − η1 ∂L(α )
∂αt , thus combining
Eq. 4.49 we have
⎡M E ⎤
∂wm ⎡ ⎤
e =1 g̃e ∂αe ,m
⎢
m=1
⎥ α1
⎢ . ⎥ ⎢ . ⎥
⎢ M E ⎥ ⎢ ⎥
α̃t+1 = αt+1 − η ⎢ ∂wm ⎥ ⎢ ⎥
⎢ m=1 e =1 g̃e ∂αe ,m ⎥ ⎢ αe ⎥
⎢ ⎥ ⎣ . ⎦ (4.50)
⎣
. ⎦
M E ∂wm αE
e =1 g̃e ∂α
m=1 e ,m
=α t+1
+ ηψ α ,
t t
114 Binary Neural Architecture Search
M represents
where
E
the Hadamard
M E product and η = η1 η2 . We take ψ t =
−[ m=1 e =1 g̃e ∂α , · · · , m=1 e =1 g̃e ∂α ] . Note that, ∂w
∂wm ∂wm T
∂α is unsolvable and
e ,m e ,m
has no explicit form in NAS, which causes an unsolvable ψ t . Thus we introduce a learnable
parameter ψ̃ t for approximating ψ t , which back-propagation process is calculated as
∂L
ψ̃ t+1 = |ψ̃ t − ηψ |. (4.51)
∂ ψ̃ t
Eq. 4.50 shows that our method is based on a projection function to solve the opti-
mization coupling problem by the learnable parameter ψ̃ t . In this method, we consider the
influence of αt and backtrack the optimized state in the (t + 1)-th step to form α̃t+1 . How-
ever, the key point in optimization is where and when the backtracking should be applied.
Thus, we define the update rule as
t+1 t
t+1 P (α:,m , α:,m ), if ranking(R(wm )) > τ
α̃m = t+1 (4.52)
α̃:,m , otherwise
t+1
where P (α:,m t
, α:,m t+1
) = α:,m + η ψ̃ t α:,m
t
and the subscript ·m denotes a specific edge.
R(wm ) denotes the norm constraint of wm and is further defined as
where τ denotes the threshold for deciding whether or not to backtrack. We further define
the threshold as follows.
τ= · M (4.54)
used for both parent and child models. When applied to the Child model, the w here denotes
the reconstructed weights from the binarized weights, that is,, w = β ◦ bŵ .
FIGURE 4.15
With different λ and μ, we evaluated the Top-1 accuracies of DCP-NAS-L
on ImageNet.
116 Binary Neural Architecture Search
TABLE 4.3
Search efficiency for different search strategies on ImageNet, including previous NAS
in both the real-valued and 1-bit search space, random search, and our DCP-NAS.
Method T.P. GGN D.O. Top-1 Acc. Search Cost
PNAS - - - 74.2 225
Real-valued NAS DARTS - - - 73.1 4
PC-DARTS - - - 75.8 3.8
BNAS1 - - - 64.3 2.6
Direct BNAS BNAS2 -H - - - 63.5 -
Random Search 51.3 4.4
CP-NAS - - - 66.5 2.8
DCP-NAS-L 71.4 27.9
Auxiliary BNAS DCP-NAS-L 71.2 2.9
DCP-NAS-L 72.6 27.9
DCP-NAS-L 72.4 2.9
Note: T.P. and D.O. denote Tangent Propagation and Decoupled Optimization, respectively.
the tangent direction constraint and the reconstruction error can improve the accuracy on
ImageNet. When applied together, the Top-1 accuracy reaches the highest value of 72.4%.
Then we conduct experiments with various values of λ and μ as shown in Figure 4.15. We
observe that with a fixed value of μ, the accuracy of Top-1 increases in the beginning with
increasing λ, but decreases when λ is greater than 1e-3. When λ becomes larger, DCP-NAS
tends to select the binary architecture with a gradient similar to that of its real-valued coun-
terpart. To some extent, the 1-bit model’s accuracy is neglected, leading to a performance
drop. Another phenomenon of performance variation is that the accuracy of Top-1 increases
first and then decreases with increasing μ while λ contains fixed values. Too much atten-
tion paid to minimizing the distance between 1-bit parameters and their counterparts may
introduce a collapse of the representation ability to 1-bit models and severely degenerate
the performance of DCP-NAS.
To better understand the acceleration rate of applying the Generalized Gauss-Newton
(GGN) matrix in the search process, we conducted experiments to examine the search cost
with and without GGN. As shown in Table 4.3, we compare the searching efficiency and the
accuracy of the architecture obtained by Random Search (random selection), Real-valued
NAS methods, Binarized NAS methods, CP-NAS, DCP-NAS without GGN method, and
DCP-NAS with GGN applied. In a random search, the 1-bit supernet randomly samples
and trains an architecture in each epoch, then assigns the expectation of all performance
to each corresponding edge and operations, and returns the architecture with the highest
score, which lacks the necessary guidance in the search process and therefore has poor per-
formance for binary architecture search. Notably, our DCP-NAS without GGN is highly
computationally consumed for the second-order gradient, which is necessarily computed in
the tangent propagation. Note that directly optimizing two supernets is computationally
redundant. However, the introduction of GGN for the Hessian matrix significantly acceler-
ates the search process, reducing the search cost to almost 10% with a negligible accuracy
vibration. As shown in Table 4, with the use of GGN, our method reduces the search cost
from 29 to 2.9, which is more efficient than DARTS. Additionally, our DCP-NAS achieves a
DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs 117
TABLE 4.4
Results of the comparison on the ImageNet dataset with DCP-NAS
of the distance calculation method used to constrain the gradient of
binary NAS in the tangent direction, i.e., Eq. 4.31. We use the small
size of the model, that is, DCP-NAS-S, to evaluate the searched
architecture.
Accuracy(%)
Method Memory (MBits) Search Cost
Top1 Top5
Cosine similarity 62.5 83.9 4.2 2.9
L1-norm 62.7 84.3 4.3 2.9
F-norm 63.0 84.5 4.2 2.9
much smaller performance gap between real-valued NAS with a lower search cost by a clear
margin. We conduct ablative experiments for different architecture discrepancy calculation
methods to further clarify the tangent propagation. As shown in Table 4.4, F-norm applied
in Eq. 4.31 achieves the best performance, while the cosine similarity and the L1-norm are
not as effective as the F-norm.
5
Applications in Natural Language
Processing
5.1 Background
We first overview the background of three aspects of this section: quantization-aware train-
ing for the low-bit language model, post-training quantization for the low-bit language
model, and binary language model.
take nearly four times longer than FP model training. The slow training time undoubtedly
affects the easiness of industrial language models. Second, conducting QAT on memory-
limited devices is sometimes prohibited due to the increasing size of large language models.
As demonstrated in [5], the QAT method [285] even consumes 8.3 GB more memory than
FP when trained with knowledge distillation. On the contrary, PTQ methods can conduct
quantization by only caching the intermediate results of each layer, which can be fed into
memory-limited training devices. Third, the training set is sometimes inaccessible due to
industry data security or privacy issues. In contrast, PTQ constructs the small calibration
set by sampling only 1K ∼ 4K instances from the whole training set.
In summary, PTQ is an appealing, efficient alternative in training time, memory over-
head, and data consumption. Generally, instead of the whole training set, PTQ methods
leverage only a small portion of training data to minimize the layer-wise reconstruction error
incurred by quantization [101, 179, 180]. The layer-wise objective breaks down the end-to-
end training, solving the quantization optimization problem in a more sample-efficient [297]
and memory-saving way. Nonetheless, it is non-trivial to directly apply previous PTQ meth-
ods for language models such as BERT [54], as the performance drops sharply. For this
reason, some efforts are investigated to improve performance.
derivative in the backward propagation. In detail, for the weight of binarized linear layers,
the common practice is to redistribute the weight to zero-mean for retaining representation
information [199] and applies scaling factors to minimize quantization errors [199]. The
activation is binarized by the sign without re-scaling for computational efficiency. Thus, the
computation can be expressed as
Q = bi-linearQ (H),
K = bi-linearK (H), (5.4)
V = bi-linearV (H),
where bi-linearQ , bi-linearK , and bi-linearV represent three different binarized linear layers
for Q, K, and V, respectively. Then the attention score A is computed as follows:
1
A= √ B Q ⊗ BK ,
D (5.5)
BQ = sign(Q), BK = sign(K),
where BQ and BK are the binarized query and key, respectively. Note that the obtained
attention weight is then truncated by attention mask, and each row in A can be regarded
as a k-dim vector, where k is the number of unmasked elements. Then attention weights
BsA are binarized as
BsA = sign(softmax(A)). (5.6)
Despite the appealing properties of network binarization for relieving the massive pa-
rameters and FLOPs, it is technically hard from an optimization perspective for BERT
binarization. As illustrated in Fig. 5.1, the performance for quantized BERT drops mildly
from 32-bit to as low as 2-bit, i.e., around 0.6% ↓ on MRPC and 0.2% ↓ on MNLI-m of
the GLUE benchmark [230]. However, when reducing the bit-width to one, the performance
drops sharply, i.e., ∼ 3.8% ↓ and ∼ 0.9% ↓ on the two tasks. In summary, binarization
of BERT brings severe performance degradation compared with other weight bit-widths.
Therefore, BERT binarization remains a challenging yet valuable task for academia and in-
dustries. This section surveys existing works and advances for binarizing BERT pre-trained
models.
Fully Quantized Transformer for Machine Translation 121
FIGURE 5.1
Performance of quantized BERT with varying weight bit-widths and 8-bit activation on
MRPC and MNLI-m.
FIGURE 5.2
Fully quantized transformer.
estimates computed during training. For every forward pass, xmin and xmax variables are
updated via an exponential moving average with a momentum of 0.9.
During backpropagation, the straight-through estimator [37] is used to Bypass the un-
differentiable round function, and the gradients of clamped values are set to zero.
FIGURE 5.3
(a) Feed-forward Networks. (b) Scaled Dot-Product Attention. (c) Multi-Head Self-
Attention.
second or higher-dimension tensors. For all other operations, such as sums, the computa-
tional cost added by the quantization operation outweighs the benefit of operating with
reduced precision. As a result, they do not quantize such operations. More precisely, all
weights of the Transformer are quantized, excluding biases, due to the biases being summed
with the INT32 output of matrix multiplications, which provide no additional computational
efficiency from being quantized. Furthermore, the memory space of biases is insignificant
compared to the weight matrices. The biases only represent less than 0.1% of total weights.
As for positional embeddings, the authors quantized the embeddings once before training
due to the fixed positional embeddings. The γ weights of LayerNorms are also quantized.
For activations, the authors quantize the sum of the input embeddings with the positional
encodings in both the encoder and decoder. The (Q, K, V ) matrixs within the multi-head
self-attention are quantized. Also, the softmax’s numerator, the softmax’s denominator, the
softmax’s output, and the scaled dot-product attention’s output are quantized, as shown
in Fig. 5.3(b) and Fig. 5.3(c). At the inference stage, the authors adopt the exponential
function to replace the softmax to make the full-precision exponential function a low-bit
format. For the position-wise feed-forward networks, they quantize the output of the ReLUs
and the feed-forward themselves, as shown in Fig. √ 5.3(a). Finally, for all LayerNorms, we
quantize the numerator x − μ, the denominator σ 2 + , their quotient, and the output of
the LayerNorm.
for all weight matrices. For activations, they use tensor bucketing for the following ten-
sors: the sum of input embeddings with the positional encoding, the Q, K, V inputs, the
scaled dot-product attention’s output, the feed-forward’s output, the LayerNorm’s numer-
ator, quotient, and output.
where λi is the distribution of the top eigenvalues of Hessian of layer i, calculated with
10% of training dataset. After Ωi is computed, they sort them in descending order, and use
it as a metric to relatively determine the quantization precision. Then, quantization-aware
finetuning is performed based on the selected precision setting. The eigenvalue distribution
of various datasets are provided in Fig. 5.5.
FIGURE 5.4
The overview of algorithm prpoposed in [118].
FIGURE 5.5
Top eigenvalue distributions for different encoder layers for various datasets including SST-
2, MNLI, CoNNL-03, and SQuAD. The middle layers generally have higher mean values
and larger variance than the others. The last three layers have the smallest variance and
mean values among all layers.
FIGURE 5.6
The overview of group-wise quantization method proposed in [209]. Here Nh (number of
heads) value matrices Wv are concatenated together, resulting in a 3-d tensor. The same
color denotes the same group with a shared quantization range.
I-BERT: Integer-Only BERT Quantization 127
TABLE 5.2
Quantization results for BERT-base on SST-2. Results are obtained with 128 groups in
each layer.
Method w-bits e-bits Acc Size Size-w/o-e
Baseline 32 32 93.00 415.4 324.5
Q-BERT 8 8 92.88 103.9 81.2
DirectQ 4 8 85.67 63.4 40.6
Q-BERT 4 8 92.66 63.4 40.6
DirectQ 3 8 82.86 53.2 30.5
Q-BERT 3 8 92.54 53.2 30.5
Q-BERT(MP) 2/4(MP) 8 92.55 53.2 30.5
DirectQ 2 8 80.62 43.1 20.4
Q-BERT 2 8 84.63 43.1 20.4
Q-BERT(MP) 2/3(MP) 8 92.08 48.1 25.4
Note: The quantization bits used for weights is abbreviated as “w-bits,” embedding as
“e-bits,” model size in MB as “Size,” and model size without embedding layer in MB as
“Size-w/o-e.” For simplicity and efficacy, all the models except for Baseline are using 8-bits
activation. Here “MP” refers to mixed-precision quantization.
(number of heads) value matrices Wv are concatenated together, resulting in a 3-d tensor.
For layer-wise quantization, as shown in Fig. 5.6(a), the entire 3-d tensor will be quantized
into the same range of discrete numbers. A special case of group-wise quantization is that
each dense matrix is a group, and every matrix can have its own quantization range as
shown in Fig. 5.6(b). A more general case in Fig. 5.6(c) instead provides a more general case
where each dense matrix with respect to output neuron is partitioned, and every continuous
d
2Nh output neurons is bucketed as a group.
The results of Q-BERT on the development set of SST-2 are presented Table 5.2. SST-2
is a movie review dataset with binary annotations, where the binary label indicates positive
and negative reviews. It can be seen that Q-BERT outperform the baseline by a large margin
over various bit pricsions.
computation of GELU and Softmax, a known algorithm [49] for integer calculation of
square root is utilized to perform integer-only computation for LayerNorm. Finally, an
integral framework is introduced by exploiting these approximations of GELU, Softmax,
and LayerNorm. The illustration of I-BERT is presented in the right side of Fig. 5.4.
TABLE 5.3
I-BERT quantization result for RoBERTa-Base and RoBERTa-Large on the
development set of the GLUE benchmark. Baseline is trained from the pre-trained
models, and I-BERT is quantized and fine-tuned from the baseline.
RoBERTa-Base
Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
Baseline FP32 87.8 87.4 90.4 92.8 94.6 61.2 91.1 90.9 78.0 86.0
I-BERT INT8 87.5 87.4 90.2 92.8 95.2 62.5 90.8 91.1 79.4 86.3
Diff -0.3 0.0
-0.2 0.0 +0.6 +1.3 -0.3 +0.2 +1.4 +0.3
RoBERTa-Large
Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
Baseline FP32 90.0 89.9 92.8 94.1 96.3 68.0 92.2 91.8 86.3 89.0
I-BERT INT8 90.4 90.3 93.0 94.5 96.4 69.0 92.2 93.0 87.0 89.5
Diff +0.4 +0.4 +0.2 +0.4 +0.1 +1.0 0.0 +1.2 +0.7 +0.5
Toward Efficient Post-Training Quantization of Pre-Trained Language Models 129
FIGURE 5.7
The overview of algorithm proposed in [5].
In summary, this paper’s contributions are as follows: (1) new kernels for the efficient
and accurate integer-only GELU and Softmax. That is, the GELU and Softmax are approx-
imated with lightweight second-order polynomials, which can be evaluated with integer-only
arithmetic; (2) integer-only LayerNorm computation by leveraging a known algorithm for
integer calculation of square root [49]; and (3) a total integer-only quantization for language
models by utilizing the proposed approximations of GELU, Softmax.
j+p−1
Ln = fˆli − fli 2 . (5.10)
i=j
130 Applications in Natural Language Processing
The learnable weights and quantization parameters in the n-th module are updated by
minimizing the reconstruction errors. The proposed MREM can be optimized parallelly:
given previously trained modules, only weights and quantization parameters in the current
module are updated. Moreover, the number of modules N can be adjusted depending on
the memory constraint of computing resources. The flexibility of the number of transformer
layers ensures the proper trade-off between layer-wise correlation and memory overhead
of training devices can be achieved. Although a similar block-wise objective is previously
proposed in [137], it requires calculating second-order Hessian matrices for optimization,
which can be computationally prohibitive for large language models.
The hyperparameter λ controls the strength of teacher forcing. λ = 1 gives the full cor-
rection of reconstruction error but with forward inconsistency, e.g., the connection between
the current module and previous quantized modules is broken. While λ = 0 reduces for-
ward inconsistency, it suffers from the propagated reconstruction error. To achieve a good
trade-off between reconstruction error reduction and forward inconsistency elimination, a
linear decay strategy for λ is proposed:
t
λt = max(1 − , 0), (5.12)
T0
where T0 is the preset maximum steps of the decay. In the beginning, a large λ is desired
since each module is rarely optimized. Later, a small λ is preferred to transit to normal
training such that the forward inconsistency can be bridged. The remaining T − T0 steps
stick to normal training so that each quantized module adapts to its own predecessors.
The comparsion between the proposed method and other existing state-of-the-art BERT
quantization methods are presented in Table 5.4. From Table 5.4, both the proposed MREM-
S and MREM-P outperform existing PTQ approaches in most cases, and even achieve results
close to QAT approaches. For example, the “W4-E4-A8” quantized MREM-S and MREM-P
have the averaged accuracies of 83.5% and 83.4% on MNLI respectively are on par with
“W2/4-E8-A8” quantized Q-BERT. In terms of the “W2-E2-A8” quantized models, our
MREM-S and MREM-P surpass GOBO by 11.7% ↑ and 11.3% ↑ on MNLI-m, respectively.
In summary, this paper’s contributions are as follows: (1) module-wise reconstruction
error minimization (MREM) that is a fast, memory-saving, and data-efficient approach
to improve the post-training quantization for language models; (2) a new model parallel
strategy based on MREM to accelerate post-training quantization with theoretical speed-
up for distributed training; and (3) annealed teacher forcing to alleviate the propagation of
reconstruction error and boost the performance.
TABLE 5.4
Results on the GLUE development set. “MREM-S” denotes sequential optimization.
Quantization #Bits (W-E-A) Size PTQ MNLI-m QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
- full-prec. 418 - 84.9 91.4 92.1 93.2 59.7 90.1 86.3 72.2 83.9
Q-BERT 2-8-8 43 - 76.6 - - 84.6 - - - - -
Q-BERT 2/4-8-8 53 - 83.5 - - 92.6 - - - - -
Quant-Noise PQ 38 - 83.6 - - - - - - - -
TernaryBERT 2-2-8 28 - 83.3 90.1 91.1 92.8 55.7 87.9 87.5 72.9 82.7
GOBO 3-4-32 43 83.7 - - - - 88.3 - - -
GOBO 2-2-32 28 71.0 - - - - 82.7 - - -
MREM-S 4-4-8 50 83.5 90.2 91.2 91.4 55.1 89.1 84.8 71.8 82.4
2-2-8 28 82.7 89.6 90.3 91.2 52.3 88.7 86.0 71.1 81.5
MREM-P 4-4-8 50 83.4 90.2 91.0 91.5 54.7 89.1 86.3 71.1 82.2
2-2-8 28 82.3 89.4 90.3 91.3 52.9 88.3 85.8 72.9 81.6
Note: “MREM-P” denotes parallel optimization. “Size” refers to model storage in “MB”. “PTQ”
indicates whether the method belongs to post-training quantization.“Avg.” denotes the average
results of all tasks.
132 Applications in Natural Language Processing
5.6.1 Analysis
Specifically, the analysis presents two findings: (1) the scaling parameter in LayerNorm
amplifies the outliers from embedding dimensions and (2) when clipping the outliers and
evaluating the final performance, the importance of outliers is highly varied. For the first
finding, the scaling parameter γ in the LayerNorm structure works as an outlier amplifier,
which amplifies the outliers in the output. For token t at j-th embedding dimension, the
LayerNorm is defined as follows:
Xt,j − μt
X̃t,j = · γ j + βj , (5.13)
σt2 +
where μt and σt2 are the mean and variance of token t, respectively. Then, by observing the
formula of LayerNorm, the multiplier γ plays a crucial part in amplifying the magnitude of
the token t, as shown in Fig. 5.8 Thus, they propose to remove the amplification effect by
extracting γ from Eq. (5.13) and use the Non-scaling LayerNorm Eq. (5.14):
Xt,j − μt βj
Xt,j = · γj + , (5.14)
σt2 + γ
Since the magnitude of the token t is shortening by extracting γ, the resulting X behaves
more friendly than X̃ for quantization.
For the second finding, they discover that the influence of final performance when clip-
ping the outliers varies greatly. In particular, when clipping the outliers and evaluating the
final performance, they find that the importance of outliers is highly varied. Take the out-
liers after GELU as an example. Fig. 5.9 shows that clipping the more aggressive outliers
sharply (clipping signals in 10-100 to 10) even does not hurt the full-precision performance
with accuracy still at 91.02. At the same time, the accuracy drops suddenly to 85.93 with
too many outliers cut. In addition, though those less important outliers might present in a
long tail form, they are only provided by a few tokens. In particular, unimportant outliers
which can be clipped without even any accuracy drop in FP models only correspond to
a few tokens. From the red points in Fig. 5.9, which represents the proportion of clipped
tokens, it can be clearly seen that the more aggressive outliers though occupy a large range
from 10 to 100 only matches with 3% tokens. Destroying those sharper outliers belonging
to a few tokens will not affect the performance.
Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language Models 133
FIGURE 5.8
Presentation of outliers over X̃, γ and X of LayerNorm on BERT-SST-2. For example, at
dimension 308, γ and X̃ both have sharper values. By excluding γ, it can be seen that X
holds milder distribution than X̃.
FIGURE 5.9
The distribution using (mean + 3 * std) is drawn as the left border, then enumerating the
value to cut the tensor on RoBERTa-QNLI. The reflect the proportion of clipped tokens.
FIGURE 5.10
Left: quantization flow before. Right: gamma migration.
134 Applications in Natural Language Processing
where Ou is marked as the collection of upper bounds, Ol is the collection of lower bounds.
The clipping value is determined by:
cu = quantile(Ou , α),
(5.16)
cl = quantile(Ol , α),
where the quantile is the quantile function that computes the α-th quantiles of its input.
A α that minimizes the final loss is searched in a grid search manner. The author chooses
to use a uniform quantizer. Thus, according to cu and cl , a step size s0 of the uniform
quantizer can be computed given the bit-width b by s = c2b−c
u l
−1
.
At the fine-grained stage, the preliminary clipping range is optimized to obtain a better
results. The aim is to make some fine-grained adjustments in the critical area to further
provide a guarantee for the final effect. In detail, with the resulting step size s0 from the
coarse-grained stage is adopted for initialization. Then, a learning based on gradient descent
is used to update parameter step size s toward the final loss with learning rate η:
∂L
s=s−η . (5.17)
∂s
Due to the wide range of outliers only corresponding to a few tokens, passing through
the unimportant area from the token perspective (the coarse-grained stage) needs much
fewer iterations than from the value perspective (the fine-grained stage). The special design
of the two stages adequately exploits this feature and thus leads a high efficiency.
FIGURE 5.11
Loss landscapes visualization of the full-precision, ternary, and binary models on
MRPC [230].
where x ∈ {±0.2W̄1 , ±0.4W̄1 , ..., ±1.0W̄1 } are perturbation magnitudes based the absolute
mean value W̄1 of W1 , and similar rules hold for y. 1x and 1y are vectors with all elements
being 1. For each pair of (x, y), the corresponding training loss is shown in Fig. 5.11. As can
be seen, the full-precision model has the lowest overall training loss, and its loss landscape
is flat and robust to the perturbation. For the ternary model, despite the surface tilts up
with larger perturbations, it looks locally convex and is thus easy to optimize. This may
also explain why the BERT models can be ternarized without severe accuracy drop [285].
However, the loss landscape of the binary model turns out to be higher and more complex.
By stacking the three landscapes together, the loss surface of the binary BERT stands on
the top with a clear margin with the other two. The steep curvature of loss surface reflects
a higher sensitivity to binarization, which attributes to the training difficulty.
The authors further quantitatively measured the steepness of loss landscape, start-
ing from a local minima W and apply the second order approximation to the curvature.
According to the Taylor’s expansion, the loss increase induced by quantizing W can be
approximately upper bounded by
where = W − Ŵ is the quantization noise, and λmax is the largest eigenvalue of the
Hessian H at w. Note that the first-order term is skipped due to ∇(W) = 0. By taking
λmax [208] as a quantitative measurement for the steepness of the loss surface, the authors
separately calculated λmax for each part of BERT as (1) the query/key layers (MHA-QK),
(2) the value layer (MHA-V), (3) the output projection layer (MHA-O) in the multi-head
attention, (4) the intermediate layer (FFN-Mid), and (5) the output layer (FFN-Out) in the
feed-forward network. From Fig. 5.12, the top-1 eigenvalues of the binary model are higher
FIGURE 5.12
The top-1 eigenvalues of parameters at different Transformer parts of the full-precision (FP),
ternary, and binary BERT.
136 Applications in Natural Language Processing
both on expectation and standard deviation compared to the full-precision baseline and
the ternary model. For instance, the top-1 eigenvalues of MHA-O in the binary model are
∼ 15× larger than the full-precision counterpart. Therefore, the quantization loss increases
of full-precision and ternary model are tighter bounded than the binary model in Eq. (5.19).
The highly complex and irregular landscape by binarization thus poses more challenges to
the optimization.
While solution to Eq. (5.20) is not unique, the latent full-precision weights W1b , W2b are
constrained after splitting to satisfy Wt = W1b + W2b as
⎧
⎨ a · Wit if Ŵit = 0
b t
W1,i = b + Wi if Ŵit = 0, Wit > 0 , (5.21)
⎩
b otherwise
⎧
⎨ (1−a)Wi if Ŵit = 0
t
b
W2,i = −b if Ŵit = 0, Wit > 0 , (5.22)
⎩
−b + Wi otherwise
t
where a and b are the variables to solve. By Eq. (5.21) and Eq. (5.22) with Ŵt = Ŵ1b + Ŵ2b ,
we get
i∈I |Wi | + j∈J |Wj | − k∈K |Wk |
t t t
a= ,
2 i∈I |Wit |
n
i∈I |Wi | − i=1 |Wi |
n t t
|I|
b= , (5.23)
2(|J | + |K|)
where we denote I = {i | Ŵit = 0}, J = {j | Ŵjt = 0 and Wjt > 0} and K = {k | Ŵkt =
0 and Wkt < 0}. | · | denotes the cardinality of the set.
Then, the prediction-layer distillation minimizes the soft cross-entropy (SCE) between
quantized student logits ŷ and teacher logits y, i.e.,
pred = SCE(ŷ, y). (5.25)
After splitting from the half-sized ternary model, the binary model inherits its perfor-
mance on a new architecture with full width. However, the original minimum of the ternary
model may not hold in this new loss landscape after splitting. Thus, the authors further
proposed to fine-tune the binary model with prediction-layer distillation to look for a better
solution.
For implementation, the authors took DynaBERT [89] sub-networks as backbones, of-
fering both half-sized and full-sized models for easy comparison. Firstly, a ternary model of
width 0.5× with the two-stage knowledge distillation is trained until convergence. Then, the
authors splited it into a binary model with width 1.0×, and perform further fine-tuning with
prediction-layer distillation. Table 5.5 compares their proposed BinaryBERT with a variety
of state-of-the-art counterparts, including Q-BERT [208], GOBO [279], Quant-Noise [65]
and TernaryBERT [285] for quantizing BERT on MNLI of GLUE [230] and SQuAD [198].
Aside from quantization, other general compression approaches are also compared such
as DistillBERT [206], LayerDrop [64], TinyBERT [106], and ALBERT [126]. BinaryBERT
has the smallest model size with the best performance among all quantization approaches.
Compared with the full-precision model, BinaryBERT retains competitive performance with
significantly reduced model size and computation. For example, it achieves more than 24×
compression ratio compared with BERT-base, with only 0.4% ↓ and 0.0%/0.2% ↓ drop on
MNLI-m and SQuAD v1.1, respectively.
In summary, this paper’s contributions can be concluded as: (1) The first work to explore
BERT binarization with an analysis for the performance drop of binarized BERT models. (2)
A ternary weight-splitting method splits a trained ternary BERT to initialize BinaryBERT,
followed by fine-tuning for further refinement.
138 Applications in Natural Language Processing
FIGURE 5.13
Structure of BinaryBERT-based BEBERT. The dashed lines denoted with A, B, and C
represent combining ensemble with different KD strategies.
Inspired by the empirical opinion in [3] that convolutional neural networks can improve
little accuracy if using ensemble learning after the KD procedures, the authors removed the
KD during ensemble for accelerating the training of BEBERT. Although the two-stage KD
performs better in [106], it is time-consuming to conduct forward and backward propaga-
tion twice. Ensemble with prediction KD can avoid double propagation and ensemble can
even remove the evaluation process of the teacher model. The authors further conducted
experiments to show whether applying KD in ensemble BinaryBERT has a minor effect on
its accuracy in the GLUE datasets, showing that BEBERT without KD can save training
time while preserving accuracy. They further compared BEBERT to various SOTA com-
pressed BERTs. The results listed in Table 5.6 suggest BEBERT outperforms BinaryBERT
in accuracy by up to 6.7%. Compared to the full-precision BERT, it also saves 15× and 13×
on FLOPs and model size, respectively, with a negligible accuracy loss of 0.3%, showing the
potential for practical deployment.
In summary, this paper’s contributions can be concluded as: (1) The first work that
introduces ensemble learning to binary BERT models to improve accuracy and robustness.
(2) Removing the KD procedures during ensemble accelerates the training process.
5.9.1 Bi-Attention
To address the information degradation of binarized representations in the forward prop-
agation, the authors proposed an efficient Bi-Attention structure based on information
theory, which statistically maximizes the entropy of representation and revives the atten-
tion mechanism in the fully binarized BERT. Since the representations (weight, activation,
and embedding) with extremely compressed bit-width in fully binarized BERT have lim-
140 Applications in Natural Language Processing
FIGURE 5.14
Attention-head view for (a) full-precision BERT, (b) fully binarized BERT baseline, and
(c) BiBERT for same input. BiBERT with Bi-Attention shows similar behavior with the
full-precision model, while baseline suffers indistinguishable attention for information degra-
dation.
ited capabilities, the ideal binarized representation should preserve the given full-precision
counterparts as much as possible means the mutual information between binarized and
full-precision representations should be maximized. When the deterministic sign function
is applied to binarize BERT, the goal is equivalent to maximizing the information entropy
H(B) of binarized representation B [171], which is defined as
H(B) = − p(B) log p(B), (5.26)
B
where B ∈ {−1, 1} is the random variable sampled from B with probability mass function
p. Therefore, the information entropy of binarized representation should be maximized to
better preserve the full-precision counterparts and let the attention mechanism function
well.
As for the attention structure in full-precision BERT, the normalized attention weight
obtained by softmax is essential. But direct application of binarization function causes a
complete information loss to binarized attention weight. Specifically, since the softmax(A)
is regarded as following a probability distribution, the elements of BsA are all quantized to
1 (Fig. 5.14(b)) and the information entropy H(BsA ) degenerates to 0. A common measure
to alleviate this information degradation is to shift the distribution of input tensors before
applying the sign function, which is formulated as
B̂sA = sign (softmax(A) − τ ) , (5.27)
where the shift parameter τ , also regarded as the threshold of binarization, is expected to
maximize the entropy of the binarized B̂sA and is fixed during the inference. Moreover, the
attention weight obtained by the sign function is binarized to {−1, 1}, while the original
attention weight has a normalized value range [0, 1]. The negative value of attention weight
in the binarized architecture is contrary to the intuition of the existing attention mechanism
and is also empirically proved to be harmful to the attention structure.
To mitigate the information degradation caused by binarization in the attention mech-
anism, the authors introduced an efficient Bi-Attention structure for fully binarized BERT,
which maximizes information entropy of binarized representations statistically and applies
bitwise operations for fast inference. In detail, they proposed to binarize the attention weight
into the Boolean value, while the design is driven by information entropy maximization. In
Bi-Attention, bool function is leveraged to binarize the attention score A, which is defined
as
1, if x ≥ 0
bool(x) = , (5.28)
0, otherwise
BiBERT: Accurate Fully Binarized BERT 141
∂ bool(x) 1, if |x| ≤ 1
= (5.29)
∂x 0, otherwise.
By applying bool(·) function, the elements in attention weight with lower value are binarized
to 0. Thus the obtained entropy-maximized attention weight can filter the crucial part of
elements. And the proposed Bi-Attention structure is finally expressed as
( )
1
BA = bool (A) = bool √ B Q ⊗ BK , (5.30)
D
Bi-Attention(BQ , BK , BV ) = BA BV , (5.31)
where BV is the binarized value obtained by sign(V), BA is the binarized attention weight,
and is a well-designed Bitwise-Affine Matrix Multiplication (BAMM) operator composed
by ⊗ and bitshift to align training and inference representations and perform efficient bitwise
calculation.
In a nutshell, in Bi-Attention structure, the information entropy of binarized attention
weight is maximized (as Fig. 5.14(c) shows) to alleviate its immense information degradation
and revive the attention mechanism. Bi-Attention also achieves greater efficiency since the
softmax is excluded.
Q × Q K × K V × V
PQ = , PK = , PV = , (5.32)
Q × Q K × K V × V
where · denotes 2 normalization. The corresponding PQ T , PKT , PVT are constructed
in the same way by the teacher’s activation. The distillation loss is expressed as:
where L denotes the number of transformer layers, FDMD = {Q, K, V}. The loss term hid
is constructed as the 2 normalization form.
The overall pipeline for BiBERT is shown in Fig. 5.15. The authors conducted experi-
ments on the GLUE benchmark for binarizing various BERT-based pre-trained models. The
results listed in Table 5.7 shows that BiBERT surpasses BinaryBERT by a wide margin in
the average accuracy.
142 Applications in Natural Language Processing
FIGURE 5.15
Overview of BiBERT, applying Bi-Attention structure for maximizing representation infor-
mation and Direction-Matching Distillation (DMD) scheme for accurate optimization.
TABLE 5.7
Quantization results of BiBERT on GLUE
benchmark. The average results of all tasks
are reported.
Method #Bits Size GLUE
BERT-base full-prec. 418 82.84
BinaryBERT 1-1-4 16.5 79.9
TernaryBERT 2-2-2 28.0 45.5
BinaryBERT 1-1-2 16.5 53.7
TernaryBERT 2-2-1 28.0 42.3
BinaryBERT 1-1-1 16.5 41.0
BiBERT 1-1-1 13.4 63.2
BERT-base6L full-prec. 257 79.4
BiBERT6L 1-1-1 6.8 62.1
BERT-base4L full-prec. 55.6 77.0
BiBERT4L 1-1-1 4.4 57.7
In summary, this paper’s contributions can be concluded as: (1) The first work to explore
fully binary pre-trained BERT-models. (2) An efficient Bi-Attention structure for maximiz-
ing representation information statistically. (3) A Direction-Matching Distillation (DMD)
scheme to optimize the full binarized BERT accurately.
FIGURE 5.16
Overview of BiT. A transformer block contains the multi-head self-attention and feed-
forward network. All the weights are binarized to {−1, 1} in the Embedding/Fully-
Connected layers and binarize activations to {0, 1} for ReLU/Softmax outputs and to
{−1, 1} for other layers.
distilling higher precision models into lower precision students. They are introduced in detail
as follows.
T
In that case, XˆB X̂B = nXR , where nXR is number of elements in XR , and α∗ can be
solved as:
XR T X̂B ||XR ||l1
α∗ = = (5.38)
nX R nX R
For the activations in attention layers or after the ReLU non-linearity layers with XR ∈
Rn+ , the authors binarized the activations to X̂B ∈ {0, 1}n by rounding the real-valued
activations:
i i 0, if XiR < 0.5
X̂B = Clip(XR , 0, 1) = (5.39)
1, if XiR 0.5
T
In that case, XˆB X̂B = n{XR 0.5} where n{XR 0.5} denotes the number of elements in XR
that are greater than or equal to 0.5. Then α∗ can be solved as:
XiR − β
XiB = αX̂iB = α Clip( , 0, 1) (5.41)
α
In the function, α is initialized with α∗ in Eq. (5.38) and β to be 0, and it is trained with
gradients from the final loss. To back-propagate the gradients to α through the discretized
binarization function, the straight-through estimator (STE) [9] is leveraged to bypass the
incoming gradients to the round function to be the outgoing gradients:
∂XiB ∂ X̂iB
= X̂iB + α
∂α ∂α
XiR −β
ST E ∂Clip( , 0, 1)
≈ X̂iB
+α α
⎧ ∂α
(5.42)
⎪
⎪ 0, if XiR < β
⎪
⎨ β−Xi
α
R
, if β XiR < α/2 + β
=
⎪
⎪ 1−
XiR −β
, if α/2 + β XiR < α + β
⎪
⎩ α
1, if XiR α + β
BiT: Robustly Binarized Multi-Distilled Transformer 145
For the layers that contain both positive and negative real-valued activations i.e., XR ∈
Rn , the binarized values X̂B ∈ {−1, 1}n are indifferent to the scale inside the Sign function:
Xi −β
XiB = α · Sign( Rα ) = α · Sign(XiR − β). In that case, since the effect of scaling factor α
inside the Sign function can be ignored, the gradient w.r.t. α can be simply calculated as
∂XiB
∂α = Sign(XR − β).
i
level. In practice, the authors found that down to a quantization level of W1A2, and one
can distill models of reasonable accuracy in single shot. As a result, they followed a fixed
quantization schedule, W32A32 → W1A2 → W1A1.
BiT, which is shown in Fig. 5.16, combines the elastic binary activations with multi-
distillation obtain, BiT simultaneously ensures good initialization for the eventual student
model. Since the binary loss landscape is highly irregular, good initialization is critical to
aid optimization.
In summary, this paper’s contributions can be concluded as: (1) The first demonstration
of fully binary pre-trained BERT models with less performance degradation. (2) A two-
set binarization scheme, an elastic binary activation function with learned parameters, a
multi-distillation method to boost the performance of binarzed BERT models.
1 (a, b) > (c, d) if a > c and b ≥ d or a ≥ c and b > d.
146 Applications in Natural Language Processing
Eq = Normalize(CNN(BERT(”[Q]q0 q1 · ql ”))),
(5.44)
Ed = Filter(Normalize(CNN(BERT(”[D]d0 d1 · dn ”)))),
Êq = Eq (I − Pq ). (5.46)
Compare to the unprocessed embedding bag, i.e., Eq , embedding presents a diffused seman-
tic structure with a more balanced spectrum (distribution of singular values) in expectation.
Post-Training Embedding Binarization for Fast Online Top-K Passage Matching 147
||Êqi ||1
Bq i = · sign(Êqi ), (5.47)
c
where i ∈ ||Êq || and c denotes the embedding dimension. The binarized embedding bag Bq
sketches the original embeddings via (1) binarized codes and (2) embedding scaler, both
of which collaboratively reveal the value range of original embedding entries. Moreover,
such rescaled binarization supports the bit-wise operations for computation acceleration in
match-scoring prediction. To mitigate this, the authors further utilized the approximation
of Unit Impulse Function [58] to furnish the accordant gradient estimation as:
∂μ(t) 1, t = 1,
= (5.48)
∂t 0, otherwise .
The above equation replaces most of floating-point arithmetics with bit-wise operations,
providing the potentiality of online computation acceleration. Lastly, Bi-ColBERT adopts
the training paradigm of ColBERT that is optimized via the pairwise softmax cross-entropy
loss over the computed scores of positive and negative passage samples.
The proposed Bi-ColBERT is evaluated on the MS-MARCO Ranking dataset [182]. It
is a collection of 8.8M passages from 1M real-world queries to Bing. Each query is associ-
ated with sparse relevance judgments of one (or a small number of) documents marked as
relevant and no documents explicitly marked as irrelevant. The results listed in Table 5.8
suggests a trade-off between passage searching quality and retrieval cost, where ColBERT
aims to simplify the neural architecture and Bi-ColBERT focuses on effective embedding
binarization.
148 Applications in Natural Language Processing
TABLE 5.8
Quantization results of Bi-ColBERT.
Model MRR@10
BERTbase 16.7
BERTlarge 19.8
ColBERT 32.8
Bi-ColBERT 31.7
In summary, this paper’s contributions can be concluded as: (1) The first work to binarize
ColBERT. (2) A semantic diffusion method to hedge the information loss against embedding
binarization. (3) An approximation of Unit Impulse Function [18] for more accurate gradient
estimation.
6
Applications in Computer Vision
6.1 Introduction
In this section, we introduce the applications of binary neural networks in the field of com-
puter vision. Specifically, we introduce the vision tasks including person re-identification, 3D
point cloud processing, object detection, and speech recognition. First, we briefly overview
these areas.
ric space distances to learn local features with increasing contextual scales, with novel set
learning layers to adaptively combine features from multi-scale based on uniform densities.
PointCNN [134] is introduced to learn an X transformation from input points to simulta-
neously weigh the input features associated with the points and then permute them into
latent potentially canonical order. Grid-GCN [256] takes advantage of the Coverage-Aware
Grid Query (CAGQ) strategy for point-cloud processing, which leverages the efficiency of
grid space. In this way, Grid-GCN improves spatial coverage while reducing theoretical time
complexity.
MSE loss
Discriminator High-level
Feature
ͳ ൈ ͳ ݒ݊ܥǤ כ
܉ு ܉ு
݂ሺڄሻ
FR-GAL
BiConv.
BiConv.
PReLU
PReLU
Cross
BN
BN
BN
BN
FC
FC
Entropy
FIGURE 6.1
An illustration of BiRe-ID based on KR-GAL and FR-GAL, applying Kernel Refining
Generative Adversarial Learning (KR-GAL) and Feature Refining Generative Adversarial
Learning (FR-GAL). KR-GAL consists of the unbinarized kernel wi , corresponding bina-
rized kernel bwi , and the attention-aware scale factor αi . αi is employed to channel-wise
reconstruct the binarized kernel bwi . We employ conventional MSE loss and a GAN to fully
refine wi and αi . FR-GAL is a self-supervision tool to refine the features of the low-level
layers with the semantic information contained by the high-level features. To compare the
features of the low- and high-level parts, we employ a 1×1 convolution and nearest neighbor
interpolation f (·) to keep the channel dimension identical. Then the high-level features can
be utilized to refine the low-level feature through a GAN.
Q = {a1 , a2 , · · · , an } , (6.1)
where Q is a discrete set and n is the bit size of the set Q. For example, n is set as 216 when
performing 16-bit quantization. Then, we define the projection of x ∈ R onto the set Q as
⎧
⎪
⎪ a1 , x < a1 +a 2
⎪
⎪
2
⎨ ···
PR→Q (x) = ai , ai−12+ai ≤ x < ai +a i+1
. (6.2)
⎪
⎪
2
⎪
⎪ · · ·
⎩ an−1 +an
an , 2 ≤x
By projecting 32-bit wights and activations into low bit cases, the computation source
will be reduced to a great deal. For extreme cases, binarizing weights and activations of
neural networks decreases the storage and computation cost by 32× and 64×, respectively.
Considering the binarization process of BNNs, Eqs. 6.34 and 6.79 are relaxed into
−1, x < 0
PR→B (x) = , s.t. B = {−1, +1} , (6.3)
+1, 0 ≤ x
where we set a1 = −1 and a2 = +1. Then PR→B (·) is equivalent to the sign function i.e.,
sign(·).
The learning objective of conventional BNNs (XNOR-Net) is defined to minimize the
geometry distance between x and PR→B (x) as
where α is an auxiliary scale factor. In recent works of binarized neural networks (BNNs)
[199, 159], they explicitly solve the objective as
x1
α= , (6.5)
size(x)
where size(x) denotes the number of elements in x. However, this objective is insufficient to
maintain the information of the real-valued counterpart x. To overcome this shortcoming,
we introduce the kernel refining convolution.
Furthermore, XNOR-Net, which aligns with most BNNs, leads to intrachannel feature
homogenization, thus causing degradation of feature representation capacity. Hence, a new
feature refinement method should be introduced.
ai = ai−1 ⊗ wi , (6.6)
where ⊗ is the convolutional operation. As mentioned above, the BNN model aims to
binarize wi and ai into PR→B (wi ) and PR→B (ai ). For simplification, in this chapter, we
denote PR→B (wi ) and PR→B (ai ) as bwi ∈ Bmi and bai ∈ Bni in this chapter, respectively.
BiRe-ID: Binary Neural Network for Efficient Person Re-ID 153
Then, we use efficient XNOR and Bit-count operations to replace real-valued operations.
Following [199], the forward process of the BNN is
ai = bai−1 b wi , (6.7)
where represents efficient XNOR and Bit-count operations. Based on XNOR-Net, we
introduce a learnable channel-wise scale factor to modulate the amplitude of real-valued
convolution. Aligned with the Batch Normalization (BN) and activation layers, the 1-bit
convolution is formulated as
bai = sign(Φ(αi ◦ bai−1 bwi )). (6.8)
In KR-GAL, the original output feature ai is first scaled by a channel-wise scale factor
(vector) αi ∈ RCi to modulate the amplitude of the real-valued counterparts. It then enters
Φ(·), which represents a composite function built by stacking several layers, e.g., BN layer,
non-linear activation layer, and max pool layer. The output is then binarized to obtain the
binary activations bai ∈ Bni , using the sign function. sign(·) denotes the sign function that
returns +1 if the input is greater than zeros and −1 otherwise. Then, the 1-bit activation
bai can be used for the efficient XNOR and Bit-count of (i + 1)-th layer.
However, the gap in representational capability between wi and bwi could lead to a
large quantization error. We aim to minimize this performance gap to reduce the quan-
tization error while increasing the binarized kernels’ ability to provide information gains.
Therefore, αi is also used to reconstruct bwi into wi . This learnable scale factor can lead to
a novel learning process with more precise estimation of convolutional filters by minimizing
a novel adversarial loss. Discriminators D(·) with weights WD are introduced to distinguish
unbinarized kernels wi from reconstructed ones αi ◦ bwi . Therefore, αi and WD are learned
by solving the following optimization problem.
arg min max LAdv
K
(wi , bwi , αi , WD ) + LM
K
SE (wi , b , αi ) ∀ i ∈ N,
wi
(6.9)
wi ,bwi ,αi WD
where LAdv
K
(wi , bwi , αi , WD ) is the adversarial loss as
LAdv
K
(wi , bwi , αi , WD ) = log(D(wi ; WD )) + log(1 − D(bwi ◦ αi ; WD )), (6.10)
where D(·) consists of several basic blocks, each with a fully connected layer and a
LeakyReLU layer. In addition, we employ discriminators to refine every binarized con-
volution layer during the binarization training process.
Furthermore, LM SE (wi , bwi , αi ) is the kernel loss between the learned real-valued filters
wi and the binarized filters bwi , which is expressed by MSE as
λ
LM
K wi
SE (wi , b , αi ) = ||wi − αi ◦ bwi ||22 , (6.11)
2
where MSE is used to balance the gap between real value wi and binarized bwi . λ is a
balance hyperparameter.
where f (·) is the nearest-neighbor interpolation. Therefore, we formulate the learning ob-
jective for feature refinement as
where LAdv
K
(wi , bwi , αi , WD ) is the adversarial loss as
LAdv
F
(aL , a∗H , WD ) = log(D(a∗H ; WD )) + log(1 − D(aL ; WD )), (6.14)
where D(·) consists of several basic blocks, each with a fully connected layer and a
LeakyReLU layer. In addition, we adopt several discriminators to refine the features during
the binarization training process.
Moreover, LM F wi
SE (wi , b , αi ) is the feature loss between the low-level and high-level
features, which is expressed by MSE as
∗ μ
LM
F
SE (aL , aH ) = ||aL − a∗H ||22 , (6.15)
2
where μ is a balancing hyperparameter.
6.2.4 Optimization
For a specific task, the conventional problem-dependent loss LS e.g., the cross entropy, is
considered, thus the learning objective is defined as
arg min = LS (wi , αi , pi ) ∀ i ∈ N, (6.16)
wi ,αi ,pi
where pi denotes the other parameters of BNN, e.g, parameters of BN and PReLU. There-
fore, the general learning objective of BiRe-ID is Eqs. 6.79, 6.13, and 6.16. For each convo-
lutional layer, we sequentially update wi , αi and pi .
Updating wi : Consider δwi as the gradient of the real-valued kernels wi . Thus,
K F K F
∂L ∂LS ∂LAdv ∂LAdv ∂LM SE ∂LM SE
δ wi = = + + + + . (6.17)
∂wi ∂wi ∂wi ∂wi ∂wi ∂wi
During the backpropagation of softmax loss LS (wi , αi , pi ), the gradients go to bwi first
and then to wi . Thus, we formulate is as
∂LS ∂LS ∂bwi
= , (6.18)
∂wi ∂bwi ∂wi
where ⎧
⎨1.2 + 2wi , −1 ≤ wi < 0,
∂bwi
= 2 − 2wi , 0 ≤ wi < 1, (6.19)
∂wi ⎩
10, otherwise,
which is an approximation of the 2×dirac-delta function [159]. Furthermore,
K
∂LAdv 1 ∂D
= . (6.20)
∂wi D(wi ; WD ) ∂wi
K
∂LM SE
= λ(wi − αi ◦ bwi ) ◦ αi , (6.21)
∂wi
F
∂LAdv 1 ∂D ∂ai
=− I(i ∈ L), (6.22)
∂wi 1 − D(ai ; WD ) ∂ai ∂wi
BiRe-ID: Binary Neural Network for Efficient Person Re-ID 155
F
∂LM ∂ai
SE
= μ(ai − a∗H ) I(i ∈ L), (6.23)
∂wi ∂wi
where I is an indicator function defined as
1, i − th layer is supervised with FR − GAL
I(i ∈ L) = . (6.24)
0, else
As mentioned above, we employ several FR-GALs in the training process. Therefore, I(i ∈ L)
denotes whether i-th layer is supervised with FR-GAL. Note that FR-GAL is only used to
supervise the low-level feature. Thus, no gradient is aroused to the high-level feature.
In this way, we calculate every specific gradient of wi as
w i ← wi − η 1 δ wi , (6.25)
K
∂LM SE
= −λ(wi − αi ◦ bwi )bwi , (6.29)
∂αi
F
∂LAdv 1 ∂D ∂ai
=− I(i ∈ L), (6.30)
∂αi 1 − D(ai ; WD ) ∂ai ∂αi
F
∂LM ∂ai
SE
= μ(ai − a∗H ) I(i ∈ L), (6.31)
∂αi ∂αi
Update pi : Finally, we update the other parameters pi with wi and αi fixed. δpi is defined
as the gradient of pi as
∂LS
δp i = (6.32)
∂pi
p i ← p i − η 3 δ pi , (6.33)
where η3 is the learning rate for other parameters. These derivations demonstrate that the
refining process can be trained from the beginning to the end. The training process of our
BiRe-ID is summarized in Algorithm 13. We independently update the parameters while
fixing other parameters of convolutional layers to enhance the variation of the feature maps
in every layer. In this way, we can accelerate the convergence of training and fully explore
the potential of our 1-bit networks.
156 Applications in Computer Vision
FIGURE 6.2
The variety of BiRe-ID’s final mAPs on Market-1501. An ablation study on λ and μ. ResNet-
18 backbone is employed.
baseline network, as shown in the second section of Table 6.5. By adding all KR-GAL and
FR-GAL, our BiRe-ID achieves 10.0% higher mAP and 9.8% higher Rank@1 accuracy than
the baseline, even approximating the corresponding real-valued network accuracy.
TABLE 6.1
The effects of different components in BiRe-ID on the Rank@1 and mAP on the
Market-1501 dataset.
ResNet-18 Rank@1 (%) mAP (%)
XNOR-Net 63.8 40.1
Proposed baseline network 74.9 54.0
Proposed baseline network + KR-GAL 80.0 61.1
Proposed baseline network + FR-GAL 78.5 58.1
Proposed baseline network + KR-GAL + FR-GAL (BiRe-ID) 84.1 64.0
Real-valued Counterpart 85.1 64.3
158 Applications in Computer Vision
FIGURE 6.3
Subfigure (a) and (b) illustrate the robustness of the Gaussian distribution and the bimodal
distribution. From left to right in each subfigure, we plot the distribution of the unbinarized
weights wi and the binarized weights bwi . The XNOR-Net’s drawback lies in subfigure (a).
If a disturbance γ is on the unbinarized weights by the discrete activation, there will be a
significant disturbance on the binarized weight. The subfigure (b) shows the robustness of
the bimodal distribution when influenced by the same disturbance.
Q = {a1 , a2 , · · · , an } , (6.34)
where Q is a discrete set and n is the bit size of the set Q. For example, n is set as 216 when
performing 16-bit quantization.
Then, we define the projection of x ∈ R onto the set Q as
⎧
⎪
⎪ a1 , x < a1 +a 2
⎪
⎪
2
⎨ ···
PR→Q (x) = ai , ai−12+ai ≤ x < ai +a i+1
. (6.35)
⎪
⎪
2
⎪
⎪ · · ·
⎩ an−1 +an
an , 2 ≤x
By projecting 32-bit wights and activations into low bit cases, the computation source
will be reduced to a great deal. For extreme cases, binarizing weights and activations of
neural networks decreases the storage and computation cost by 32× and 64×, respectively.
Considering the binarization process of BNNs, Eqs. 6.34 and 6.79 are relaxed into
−1, x < 0
PR→B (x) = , s.t. B = {−1, +1} , (6.36)
+1, 0 ≤ x
POEM: 1-Bit Point-Wise Operations Based on E-M for Point Cloud Processing 159
܉ିଵ ܉ ܊షభ
)ڄ(݊݃݅ݏ Real-valued FC layer
[1.23,0.12, ڮ, െ0.66] [+1, +1, ڮ, െ1]
STE [9,7, ڮ, െ5] ܉ Bi-FC layer
ٖ
0.14 ڮെ1.02 )ڄ(݊݃݅ݏ +1 ڮെ1 ל [1.08,2.87, ڮ, െ1.60]
ڭ ڰ ڭ ڭ ڰ ڭ [0.12,0.41, ڮ, 0.32]
െ0.54 ڮ1.75 EM+STE െ1 ڮ+1
ܟ ܟ ߙ
܊
Transform Transform
output scores
MaxPooling
1× 1024
݊ × 64
݊ × 64
݊×3
݊×3
݊ × 1024
FIGURE 6.4
Outline of the 1-bit PointNet obtained by our POEM on the classification task. We save
the first and last fully connected layer as real valued, which is with horizontal stripes. We
give the detailed forward and back propagation process of POEM, where EM denotes the
Expectation-Maximization algorithm, and STE denotes Straight-Through-Estimator.
where we set a1 = −1 and a2 = +1. Then PR→B (·) is equivalent to the sign function, i.e.,
sign(·).
However, The binarization procedure achieved by PR→B (x) is sensitive to disturbance
when x follows a Gaussian distribution, e.g., XNOR-Net. That is, the binarization results
are subjected to the noise of the raw point cloud data, as shown in Fig. 6.3. To address this
issue, we first define an objective as
where α is an auxiliary scale factor. In recent works of binarized neural networks (BNNs)
[199, 159], they explicitly solve the objective as
x1
α= , (6.39)
size(x)
where size(x) denotes the number of elements in x. However, this objective neglects that α
also influences the output of the 1-bit layer. In contrast, we also consider this shortcoming
and modify this learning object for our POEM.
Given a conventional FC layer, we denote wi ∈ Rmi and ai ∈ RCi as its weights and
features in the i-th layer, where mi = Ci × Ci−1 . Ci represents the number of output
channels of i-th layer. Then we have the following.
ai = ai−1 ⊗ wi , (6.40)
where ⊗ denotes full-precision multiplication. As mentioned above, the BNN model aims
to binarize wi and ai into PR→B (wi ) and PR→B (ai ). For simplification, in this chapter we
denote PR→B (wi ) and PR→B (ai ) as bwi ∈ Bmi and bai ∈ BCi in this paper, respectively.
Then, we use the efficient XNOR and Bit-count operations to replace full-precision opera-
tions. Following [199], the forward process of the BNN is
ai = bai−1 b wi , (6.41)
where represents efficient XNOR and Bit-count operations. Based on XNOR-Net [199],
we introduce a learnable channel-wise scale factor to modulate the amplitude of real-valued
convolution. Aligned with the Batch Normalization (BN) and activation layers, the process
is formulated as
bai = sign(Φ(αi ◦ bai−1 bwi )), (6.42)
where we divide the data flow in POEM into units for detailed discussions. In POEM, the
original output feature ai is first scaled by a channel-wise scale factor (vector) αi ∈ RCi
to modulate the amplitude of its full-precision counterparts. It then enters Φ(·), which
represents a composite function built by stacking several layers, e.g., the BN layer, the non-
linear activation layer, and the max-pooling layer. Then the output is binarized to obtain
the binary activations bai ∈ BCi , through the sign function. sign(·) denotes the sign function
that returns +1 if the input is greater than zeros and −1 otherwise. Then, 1-bit activation
bai can be used for efficient XNOR and Bit-count of the (i+1)-th layer.
where pi denotes the other parameters of real-valued layers in the network, e.g., BN layer,
activation layer, and unbinarized fully-connected layer. N denotes the number of layers in
the network. LS is the cross entropy.
And λ is a hyperparameter. Unlike binarization methods (such as XNOR-Net [199] and
Bi-Real Net [159]) where only the reconstruction loss is considered in the weight calculation.
By fine-tuning the value of λ, our proposed POEM can achieve much better performance
than XNOR-Net, which shows the effectiveness of combined loss against only softmax loss.
Our discrete optimization method comprehensively calculates the Bi-FC layers considering
the reconstruction loss and the softmax loss in a unified framework.
POEM: 1-Bit Point-Wise Operations Based on E-M for Point Cloud Processing 161
FIGURE 6.5
Illustration of training wij via Expectation-Maximization. We set a free constraint for the
weights obeying one specific distribution, i.e., which is lower than the minimum mean value
or higher than the maximum mean value. For the ones in the middle area (distribution not
transparent), we apply EM (·) to constrain it to converge to a specific distribution.
∂LS ∂LR
δ wi = +λ (6.45)
∂wi ∂wi
wi ← wi − ηδwi , (6.46)
∂LS
where LS and LR are loss functions, and η is the learning rate. ∂wi can be computed by
backpropagation, and, furthermore, we have
∂LR
= (wi − αi ◦ bwi ) ◦ αi . (6.47)
∂wi
However, this backpropagation process without the necessary constraint will result in
a Gaussian distribution of wi , which degrades the robustness of Bi-FCs as revealed in Eq.
6.80. Our POEM takes another learning objective as
Assumption 6.3.1. For every unbinarized weight of the i-th 1-bit layer, i.e., ∀wij ∈ wi , it
can be constrained to follow a Gaussian Mixture Model (GMM).
162 Applications in Computer Vision
2
P(wi |Θi ) = βik p(wi |Θki ), (6.49)
k=1
where the number of distributions is set as 2 in this paper. Θlk = {μki , σik } denotes the
parameters of the k-th distribution, i.e., μki denotes the mean value and σik denotes the
variance, respectively.
To solve the GMM with the observed data wi , i.e., the weight ensemble in the i-th
layer. We introduce the hidden variable ξijk to formulate the maximum likelihood estimation
(MLE) of GMM as
jk 1, wij ∈ pki
ξi = , (6.50)
0, else
where ξijk is the hidden variable that describes the affiliation of wij and pki (simplified deno-
tation of p(wi |Θki )). We then define the likelihood function P(wij , ξijk |Θki ) as
!
2 mi
! ξijk
1
(βik )|pi |
k
P(wij , ξijk |Θki ) = f (wij , μki , σik ) , (6.51)
j=1
Ω
k=1
mi jk 2
where Ω = 2π|σik |, |pki | = j=1 ξi , and mi = k=1 |pki |. And f (wij , μki , σik ) is defined as
1
f (wij , μki , σik ) = exp(− (wj − μki )2 ). (6.52)
2σik i
Hence, for every single weight wij , ξijk can be computed by maximizing the likelihood as
max E log P(wij , ξijk |Θki )|wij , Θki (6.53)
ξijk ,∀j,k
where E(·) represents the estimate. Therefore, the maximum likelihood estimate ξˆijk is
calculated as
After the expectation step, we perform the maximization step to compute Θki as
mi ˆjk j
j=1 ξi wi
μ̂ki = mi jk , (6.55)
ˆ
j=1 ξi
mi ˆjk j
j=1 ξi (wi − μ̂i )
k 2
σ̂ik = mi ˆjk , (6.56)
j=1 ξi
mi ˆjk
j=1 ξi
α̂ik = . (6.57)
mi
POEM: 1-Bit Point-Wise Operations Based on E-M for Point Cloud Processing 163
where η is the learning rate. The gradient derived from softmax loss can be easily calculated
on the basis of backpropagation. Based on Eq. 6.44, we have
∂LR
= (wi − αi ◦ bwi ) · bwi . (6.62)
∂αi
164 Applications in Computer Vision
FIGURE 6.6
Detailed architecture of 1-bit networks implemented by us. (a) detailed architecture of 1-
bit PointNet. MM denotes matrix multiplication in short; (b) detailed architecture of 1-bit
PointNet++. Cat denotes the concatenation operation; (c) detailed architecture of 1-bit
DGCNN; (d) detailed architecture of the FC unit and the Bi-FC unit used from (a) to (c).
We use 2 BNs in the Bi-FC Unit.
Updating pi : We finally update other parameters pi with wi and αi fixed. δpi is defined
as the gradient of pi . We formulate it as
∂LS
δ pi = (6.63)
∂pi
pi ← pi − ηδpi . (6.64)
The above derivations show that POEM is learnable with the BP algorithm. Our POEM
is supervised on the basis of a simple and effective reconstruction loss function. Moreover, we
introduce an efficient Expectation-Maximization algorithm to optimize unbinarized weights,
thus constraining them to formulate a bimodal distribution.
decrease τ . We get the optimal 1-bit PointNet with POEM with {λ, τ } set as {1×10−4 , 1×
10−3 }. Hence, we extend this hyperparameter set to the other experiments involved in this
paper.
We also set τ as 1×10−3 and plot the growth curve of POEM training accuracies with
different λ and XNOR-Net. Figure 6.7 shows that the 1-bit PointNet obtained by POEM
achieves optimal training accuracy when λ is set as 1×10−4 . Also, with EM-optimized back
propagation, the weight convergence becomes better than XNOR-Net (in purple), as shown
in Fig. 6.7.
Evaluating the components of POEM: In this part, we evaluate every critical part
of POEM to show how we compose the novel and effective POEM. We first introduce our
baseline network by adding a single BN layer ahead of the 1-bit convolutions of XNOR-Net,
which brings about an improvement 2.8% in OA. As shown in Table 6.5, the introduction
of PReLU, EM, and the learnable scale factor improves accuracy by 1.9%, 3.1%, and 3.4%,
respectively, over the baseline network, as shown in the second section of Table 6.5. By
adding all the PReLU, EM and the learnable scale factor, our POEM achieves 7.1% higher
accuracy than the baseline, even surpassing the accuracy of the corresponding real-valued
network.
Compared to merely using the PReLU, the use of our main contributions, EM and
the learnable scale factor, increases the accuracy by 5.2%, which is very significant for the
point cloud classification task. The 1-bit PointNet achieves the performance, which even
approaches the real-valued PointNet++ baseline within 2.0% (90.2% vs. 91.9%).
FIGURE 6.7
Training accuracies of POEM (τ = 1 × 10−3 ) with different λ and XNOR-Net.
166 Applications in Computer Vision
FIGURE 6.8
(a) and (b) illustrate the distribution of the unbinarized weights wi of the 6-th 1-bit layer
in 1-bit PointNet backbone when trained under XNOR-Net and our POEM, respectively.
From left to right, we report the weight distribution of initialization, 40-th, 80-th, 120-th,
160-th, and 200-th epoch. Our POEM obtains an apparent bimodal distribution, which is
much more robust.
TABLE 6.3
The effects of different components of POEM on OA.
1-bit PointNet OA (%)
XNOR-Net 81.9
Proposed baseline network 83.1
Proposed baseline network + PReLU 85.0
Proposed baseline network + EM 86.2
Proposed baseline network + LSF 86.5
Proposed baseline network + PReLU + EM + LSF (POEM) 90.2
Real-valued Counterpart 89.2
Note: PReLU, EM, and LSF denote components that are introduced into our proposed
baseline network. The proposed baseline network + PReLU + EM + LSF denotes the
POEM we propose. LSF denotes the learnable scale factor, in short.
LWS-Det: Layer-Wise Search for 1-bit Detectors 167
FIGURE 6.9
Example layer-wise feature map distribution and detection results of (a) a real-valued detec-
tor, (b) LWS-Det, and (c) BiDet. We extract the feature maps of the first, second, and final
binarized layers and illustrate their distributions based on the frequency-value histogram in
rows 1–3. The last row shows the detection result.
Figure 6.9 shows the layer-wise feature map distribution and detection results of a real-
valued detector, our LWS-Det, and BiDet [240] from left to right. The first three rows show
the distributions of feature maps. The distribution of BiDet’s feature map has a variance
less similar to the one of the real-value detector, leading to a result with false positives and
missed detection in the 4-th row. In comparison, our LWS-Det can reduce the binarization
error and provide better detection results.
In this section, we present the layer-wise search method to produce an optimized 1-bit
detector (LWS-Det) [264] using the student-teacher framework to narrow the performance
gap. As shown in Fig. 6.10, we minimize the binarization error by decoupling it into angular
and amplitude errors. We search for binarized weight supervised by well-designed losses be-
tween real-valued convolution and 1-bit convolution under differentiable binarization search
(DBS) framework, following the DARTS method [151, 305]. We formulate the binarization
problem as the combination of −1 and 1, while a differentiable search can explore the binary
space to significantly improve the capacity of 1-bit detectors. To improve the representation
ability of LWS-Det, we design two losses to supervise the 1-bit convolution layer from angu-
lar and amplitude perspective. In this way, we obtain a powerful 1-bit detector (LWS-Det)
that can minimize angular and amplitude errors in the same framework.
6.4.1 Preliminaries
Given a conventional CNN model, we denote wi ∈ Rni and ai ∈ Rmi as its weights and
feature maps in the i-th layer, where ni = Ci · Ci−1 · Ki · Ki and mi = Ci · Wi · Hi . Ci
represents the number of output channels of the i-th layer. (Wi , Hi ) are the width and
height of the feature maps and Ki is the kernel size. Then we have the following.
ai = ai−1 ⊗ wi , (6.65)
168 Applications in Computer Vision
ܟ
Real-valued Teacher ܉ିଵ
BN Conv. PReLU BN
ෝ ି
ܟ ߚ భ
-1, -1, -1 ߚଵଵభ , ߚଵଶభ , ߚଵଷభ ܮ ܮ
-1, -1, -1 ߚଶଵభ , ߚଶଶభ , ߚଶଷభ
ܟ
-1, -1, -1 ߚଷଵభ , ߚଷଶభ , ߚଷଷభ
BN ෝ ା
ܟ
ߚ మ ۩ ࢻ PReLU BN
1-bit Student ୭
ߚଵଵమ , ߚଵଶమ , ߚଵଷమ
+1, +1, +1
+1, +1, +1 ߚଶଵమ , ߚଶଶమ , ߚଶଷమ
܉ො ିଵ
+1, +1, +1 ߚଷଵమ , ߚଷଶమ , ߚଷଷమ
FIGURE 6.10
Our LWS-Det. From left to right are the input, search, and learning processes. For a given 1-
bit convolution layer, LWS-Det first searches for the binary weight (+1 or −1) by minimizing
the angular loss supervised by a real-valued teacher detector. LWS-Det learns the real-valued
scale factor α to enhance the feature representation ability.
where ⊗ is the convolution operation. We omit the batch normalization (BN) and activation
layers for simplicity. The 1-bit model aims to quantize wi and ai into w i ∈ {−1, +1}
and ai ∈ {−1, +1} using efficient xnor and bit-count operations to replace full-precision
operations. Following [99], the forward process of the 1-bit CNN is:
i = sign(
a ai−1 i ),
w (6.66)
where represents the xnor and bit-count operations and sign(·) denotes the sign function,
which returns 1 if the input is greater than zero and −1 otherwise. This binarization process
will bring about the binarization error, which can be seen in Figs. 6.11 (a) and (b). The
product of the 1-bit convolution (b) cannot simulate the one of real value (a) both in
angularity and in amplitude.
Substantial efforts have been made to optimize this error. [199, 228] formulate the object
as
Lwi = wi − αi ◦ w i 22 , (6.67)
where ◦ denotes the channel-wise multiplication and αi is the vector consisting of channel-
wise scale factors. Figure 6.11 (c) [199, 228] learns αi by directing optimizing Lw
i to 0, and
thus the explicit solution is
wij 1
αji = , (6.68)
Ci−1 · Kij · Kij
where j denotes the j-th channel of i-th layer. Other works [77] dynamically evaluate Eq.
6.80 rather than explicitly solving or modifying αi to other shapes [26].
Previous work mainly focuses on kernel reconstruction but neglects angular information,
as shown in Fig. 6.11 (d). One drawback of existing methods lies in its ineffectiveness when
binarizing a very small float value as shown in Fig. 6.11. On the contrary, we leverage
the strong capacity of a differentiable search to fully explore a binary space for an ideal
combination of −1 and +1 without a ambiguous binarization process involved.
FIGURE 6.11
An illustration of binarization error in the 3-dimension space. (a) The intersection angle θ
of real-valued weight w and activation a is significant. (b) After binarization (ŵ, â) based
on sign function, the intersection angle θ̂ = 0 . (c) θ̂ = 0 based on XNOR-Net binarization.
(d) Ideal binarization via angular and amplitude error minimization.
illustrated in Fig. 6.10. As depicted above, the main learning objective (layer-wise binariza-
tion error) is defined as
N
E= i−1
ai−1 ⊗ wi − a i ◦ αi 22 ,
w (6.69)
i=1
i , αi ; wi , ai−1 , a
argmin Ei (w i−1 ), ∀i ∈ [1, N ]. (6.70)
i ,αi
w
In LWS-Det, we learn Equ. 6.70 by decoupling it into angular loss and amplitude loss, where
we optimize the angular loss by differentiable binarization search (DBS) and the amplitude
loss by learning the scale factor.
LAng
i = cosθi − cosθi 22
ai−1 ⊗ wi i−1 w
a i (6.71)
= − 22 .
ai−1 2 wi 2 i 2
âi−1 2 w
For the learning process of the i-th layer, the objective is formulated as
argmin LAng
i i; a
(w i , wi , ai ). (6.72)
i
w
170 Applications in Computer Vision
2: for i = 1 to N do
3: while Differentiable search do
4: Compute LAngi , LAmp
i , LWi
5: end while
6: end for
7: Compute LGT , LLim
8: for i = N to 1 do
9: Update parameters via back propagation
10: end for
We introduce the DARTS framework to solve Eq. 6.72, named differential binarization
search (DBS). We follow [151] to efficiently find w i . Specifically, we approximate w
i by the
weighted probability of two matrices whose weights are all set as −1 and +1, respectively.
We relax the choice of a particular weight by the probability function defined as
exp(βiok )
poi k = ok
i− , w
, s.t. O = {w i+ }, (6.73)
ok ∈O ok ∈O exp(βi )
where poi k is the probability matrix belonging to the operation ok ∈ O. The search space
O is defined as the two possible weights: {w i− , w
i+ }. For the inference stage, we select the
weight owning the max probability as
* i,l = arg max poi,lk ,
w (6.74)
ok
where poi,lk denotes the probability that the l-th weight of the i-th layer belongs to operation
ok . Therefore, the l -th weight of w,* that is, w* i,l , is defined by the operation having the
highest probability. In this way, we modify Eq. 6.87 by substituting w i to w
* i as
ai−1 ⊗ wi i−1 w
a *i
LAng = − 2 . (6.75)
i
ai−1 2 wi 2 * i 2 2
ai−1 2 w
By this, we retain the top-1 strongest operations (from distinct weights) for each weight
i in the discrete set {+1, −1}.
of w
where LGT is the detection loss derived from the ground truth label and LLim is the fine-
grained feature limitation defined in [235]. The LWS-Det process is outlined in Algorithm
13.
TABLE 6.4
Ablation study: comparison of the performance of different
binarization methods with DBS.
Framework Backbone Binarization Method mAP
Sign 65.1
RSign 68.9
Faster-RCNN ResNet-18 Random Search 64.1
DBS 73.2
Real-valued 76.4
Sign 65.9
RSign 68.1
SSD VGG-16 Random Search 60.1
DBS 71.4
Real-valued 74.3
172 Applications in Computer Vision
FIGURE 6.12
Convergence Faster-RCNN with ResNet-18 backbone (left) and SSD with VGG-16 backbone
(right) based on different binarizations training on VOC trainval2007 and trainval2012.
FIGURE 6.13
The input images and the saliency maps follow [79]. The images are randomly selected from
VOC test2007. Each row includes: (a) input images, saliency maps of (b) Faster-RCNN
with ResNet-101 backbone (Res101), (c) Faster-RCNN with ResNet-18 backbone (Res18),
(d) 1-bit Faster-RCNN with ResNet-18 backbone (BiRes18), respectively.
that knowledge distillation (KD) methods such as [235] are effective for distilling real-valued
Faster-RCNNs, only when their teacher model and their student counterpart share small
information discrepancy on proposals, as shown in Fig. 6.13 (b) and (c). This phenomenon
does not happen for 1-bit Faster-RCNN, as shown in Fig. 6.13 (b) and (d). This might
explain why existing KD methods are less effective in 1-bit detectors. A statistic on the
COCO and PASCAL VOC datasets in Fig. 6.14 shows that the discrepancy between the
IDa-Det: An Information Discrepancy-Aware Distillation for 1-bit Detectors 173
(a) VOC trainval0712 (b) VOC test2007 (c) COCO trainval35k (d) COCO minival
FIGURE 6.14
The Mahalanobis distance of the gradient in the intermediate neck feature between Res101-
Res18 (gathering on the left) and Res101-BiRes18 (uniformly dispersed) in various datasets.
proposal saliency maps of Res101 and Res18 (blue) is much smaller than that of Res101
and BiRes18 (orange). That is to say, the smaller the distance, the smaller the discrepancy.
Briefly, conventional KD methods show their effectiveness in distilling real-valued detectors,
but seem to be less effective on distilling 1-bit detectors.
We are motivated by the observation above and present an information discrepancy-
aware distillation for 1-bit detectors (IDa-Det) [260]. This can effectively address the infor-
mation discrepancy problem, leading to an efficient distillation process. As shown in Fig.
6.15, we introduce a discrepancy-aware method to select proposal pairs and facilitate dis-
tilling 1-bit detectors, rather than only using object anchor locations of student models or
ground truth as in existing methods [235, 264, 79]. We further introduce a novel entropy dis-
tillation loss to leverage more comprehensive information than conventional loss functions.
By doing so, we achieve a powerful information discrepancy-aware distillation method for
1-bit detectors (IDa-Det).
Proposal distribution
(Channel-wise Gaussian distribution)
Real-valued Teacher
߮ሺڄሻ
Information
discrepancy
߮ሺڄሻ
1-bit Student
Entropy
distillation loss
FIGURE 6.15
Overview of the proposed information discrepancy-aware distillation (IDa-Det) framework.
We first select representative proposal pairs based on the information discrepancy. Then we
propose the entropy distillation loss to eliminate the information discrepancy.
174 Applications in Computer Vision
6.5.1 Preliminaries
In a specific convolution layer, w ∈ RCout ×Cin ×K×K , ain ∈ RCin ×Win ×Hin , and aout ∈
RCout ×Wout ×Hout represent its weights and feature maps, where Cin and Cout represents the
number of channels. (H, W ) are the height and width of the feature maps, and K denotes
the size of the kernel. Then we have the following.
aout = ain ⊗ w, (6.78)
where ⊗ is the convolution operation. We omit the batch normalization (BN) and ac-
tivation layers for simplicity. The 1-bit model aims to quantize w and ain into bw ∈
{−1, +1}Cout ×Cin ×K×K and bain ∈ {−1, +1}Cin ×H×W using efficient XNOR and Bit-count
operations to replace full-precision operations. Following [48], the forward process of the 1-
bit CNN is
aout = α ◦ bain bw , (6.79)
where is the XNOR, and bit-count operations, and ◦ denotes channel-wise multiplication.
α = [α1 , · · · , αCout ] ∈ R+ is the vector consisting of channel-wise scale factors. b = sign(·)
denotes the binarized variable using the sign function, which returns 1 if the input is greater
than zero and -1 otherwise. It then enters several non-linear layers, e.g., BN layer, non-
linear activation layer, and the max-pooling layer. We omit these for simplification. Then,
the output aout is binarized to baout via the sign function. The fundamental objective of
BNNs is to calculate w. We want it to be as close as possible before and after binarization
to minimize the binarization effect. Then, we define the reconstruction error as
LR (w, α) = w − α ◦ bw . (6.80)
ܴ௧ in Teacher
Paired ܴ௦ in Student
ܴ௦ in Student
FIGURE 6.16
Illustration for the generation of the proposal pairs. Every single proposal in one model
generates a counterpart feature map patch in the same location as the other model.
C
εn = ||(Rn;c
t
− Rn;c
s
)T Σ−1
n;c (Rn;c − Rn;c )||2 ,
t s
(6.83)
c=1
where Σn;c denotes the covariance matrix of the teacher and the student in the c-th channel
of the n-th proposal pair. The Mahalanobis distance takes into account both the pixel-
level distance between proposals and the differences in statistical characteristics in pair of
proposals.
To select representative proposals with maximum information discrepancy, we first de-
fine a binary distillation mask mn as
1, if pair (Rnt , Rns ) is selected
mn = (6.84)
0, otherwise
where mn = 1 denotes that the distillation will be applied on this proposal pair; otherwise,
it remains unchanged. For each pair of proposals, only when their distribution is quite
different can the student model learn from the teacher counterpart where a distillation
process is needed.
On the basis of the derivation above, discrepant proposal pairs will be optimized through
distillation. To distill the selected pairs, we resort to maximizing the conditional probability
p(Rns |Rnt ). That is, after distillation or optimization, the feature distributions of the teacher
proposals and the student counterparts become similar. To this end, we define p(Rns |Rnt )
with mn , n ∈ {1, · · · , NT + NS } in consideration as
2
p(Rns |Rnt ; mn ) ∼ mn N (μtn , σnt ) + (1 − mn )N (μsn , σns 2 ). (6.85)
Here is the upper level of the bi-level optimization, where m is solved and therefore
omitted. We rewrite Eq. 6.87 and further achieve our entropy distillation loss as
LP (w, α; γ) = (Rns − Rnt ) + Cov(Rns , Rnt )−1 (Rns − Rnt )2 + log(Cov(Rns , Rnt )), (6.88)
where Cov(Rns , Rnt ) = E(Rns Rnt ) − E(Rns )E(Rnt ) denotes the covariance matrix.
Hence, we train the 1-bit student model end-to-end, the total loss for distilling the
student model is defined as
L = LGT (w, α) + λLP (w, α; γ) + μLR (w, α), (6.89)
where LGT is the detection loss derived from the ground truth label, and LR is defined in
Equ. 6.80.
FIGURE 6.17
On VOC, we (a) select μ on the raw detector and different KD methods including Hint [33],
FGFI [235], and IDa-Det; (b) select λ and γ on IDa-Det with μ set as 1e−4.
by 2.5%, 2.4%, and 1.8% compared to non-distillation, Hint and FGFI, under the same
student-teacher framework. Then we evaluate the proposed entropy distillation loss against
the conventional 2 loss, the loss of the inner product and the loss of cosine similarity. As
depicted in Table 6.5, our entropy distillation loss improves the distillation performance by
0.4%, 0.3%, and 0.4% with the Hint, FGFI, and IDa method compared with 2 loss. Com-
pared to the loss of the inner product and cosine similarity, the loss of entropy outperforms
them by 2.1% and 0.5% in mAP in our framework, which further reflects the effectiveness
of our method.
TABLE 6.5
The effects of different components in IDa-Det with Faster-RCNN
model on PASCAL VOC dataset.
Model Proposal selection Distillation method mAP
Res18 78.6
BiRes18 74.0
Res101-BiRes18 Hint 2 74.1
Res101-BiRes18 Hint Entropy loss 74.5
Res101-BiRes18 FGFI 2 74.7
Res101-BiRes18 FGFI Entropy loss 75.0
Res101-BiRes18 IDa Inner-product 74.8
Res101-BiRes18 IDa Cosine similarity 76.4
Res101-BiRes18 IDa 2 76.5
Res101-BiRes18 IDa Entropy loss 76.9
Note: Hint [33] and FGFI[235] are used to compare with our information discrepancy-aware
proposal selection (IDa). IDa and Entropy loss denote main components of the proposed
IDa-Det.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Bibliography
[1] Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo.
Knowledge distillation from internal representations. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 7350–7357, 2020.
[2] Milad Alizadeh, Javier Fernández-Marqués, Nicholas D Lane, and Yarin Gal. An
empirical study of binary neural networks’ optimisation. In Proceedings of the Inter-
national Conference on Learning Representations, 2018.
[3] Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge dis-
tillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
[4] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adver-
sarial networks. In Proceedings of the International Conference on Machine Learning,
pages 214–223, 2017.
[5] Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, and Michael R Lyu. Towards
efficient post-training quantization of pre-trained language models. arXiv preprint
arXiv:2109.15082, 2021.
[6] Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael
Lyu, and Irwin King. Binarybert: Pushing the limit of bert quantization. arXiv
preprint arXiv:2012.15701, 2020.
[7] Slawomir Bak, Peter Carr, and Jean-Francois Lalonde. Domain adaptation through
synthesis for unsupervised person re-identification. In Proceedings of the European
Conference on Computer Vision, pages 189–205, 2018.
[8] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit
training of neural networks. Advances in neural information processing systems, 31,
2018.
[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating
gradients through stochastic neurons for conditional computation. arXiv preprint
arXiv:1308.3432, 2013.
[10] Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, and Christoph Meinel.
Meliusnet: Can binary neural networks achieve mobilenet-level accuracy? arXiv
preprint arXiv:2001.05936, 2020.
[11] Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, and Christoph
Meinel. Training competitive binary neural networks from scratch. arXiv preprint
arXiv:1812.01965, 2018.
[12] Joseph Bethge, Haojin Yang, Marvin Bornstein, and Christoph Meinel. Binary-
densenet: developing an architecture for binary neural networks. In Proceedings of
the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0,
2019.
179
180 Bibliography
[13] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak.
Lsq+: Improving low-bit quantization through learnable offsets and better initializa-
tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, pages 696–697, 2020.
[14] Christopher M Bishop. Bayesian neural networks. Journal of the Brazilian Computer
Society, 4(1):61–68, 1997.
[15] David M Blei, John D Lafferty, et al. A correlated topic model of science. The annals
of applied statistics, 1(1):17–35, 2007.
[16] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight
uncertainty in neural network. In Proceedings of the International Conference on
Machine Learning, pages 1613–1622, 2015.
[17] Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and
overcoming the challenges of efficient transformer quantization. arXiv preprint
arXiv:2109.12948, 2021.
[18] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its
applications, volume 31999. McGraw-Hill New York, 1986.
[19] Leo Breiman. Bias, variance, and arcing classifiers. Technical report, Tech. Rep. 460,
Statistics Department, University of California, Berkeley . . . , 1996.
[20] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot
model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344,
2017.
[21] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. Advances in neural information processing
systems, 33:1877–1901, 2020.
[22] A. Buades, B. Coll, and J. Morel. A non-local algorithm for image denoising. In
CVPR, 2005.
[23] Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. Matrix
and tensor decompositions for training binary neural networks. arXiv preprint
arXiv:1904.07852, 2019.
[24] Adrian Bulat, Brais Martinez, and Georgios Tzimiropoulos. Bats: Binary architecture
search. In Proc. of ECCV, pages 309–325, 2020.
[25] Adrian Bulat and Georgios Tzimiropoulos. Binarized convolutional landmark localiz-
ers for human pose estimation and face alignment with limited resources. In Proceed-
ings of the IEEE International Conference on Computer Vision, pages 3706–3714,
2017.
[26] Adrian Bulat and Georgios Tzimiropoulos. Xnor-net++: Improved binary neural
networks. arXiv preprint arXiv:1909.13863, 2019.
[27] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architec-
ture search by network transformation. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 32, 2018.
Bibliography 181
[28] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level net-
work transformation for efficient architecture search. In International Conference on
Machine Learning, pages 678–687. PMLR, 2018.
[29] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search
on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
[30] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object
detection. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 6154–6162, 2018.
[31] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kir-
illov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Com-
puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
[32] John G Carney, Pádraig Cunningham, and Umesh Bhagwan. Confidence and pre-
diction intervals for neural network ensembles. In IJCNN’99. International Joint
Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), volume 2, pages
1215–1218. IEEE, 1999.
[33] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker.
Learning efficient object detection models with knowledge distillation. In Proc. of
NeurIPS, 2017.
[34] Hanlin Chen, Baochang Zhang, Song Xue, Xuan Gong, Hong Liu, Rongrong Ji, and
David Doermann. Anti-bandit neural architecture search for model defense. In Com-
puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XIII 16, pages 70–85, 2020.
[35] Hanlin Chen, Li’an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, Rongrong
Ji, David Doermann, and Guodong Guo. Binarized neural architecture search for
efficient object recognition. International Journal of Computer Vision, 129:501–516,
2021.
[36] Hanlin Chen, Li’an Zhuo, Baochang Zhang, Xiawu Zheng, Jianzhuang Liu, Rongrong
Ji, David Doermann, and Guodong Guo. Binarized neural architecture search for
efficient object recognition. International Journal of Computer Vision, 129(2):501–
516, 2021.
[37] Mingzhe Chen, Ursula Challita, Walid Saad, Changchuan Yin, and Mérouane Debbah.
Artificial neural networks-based machine learning for wireless networks: A tutorial.
IEEE Communications Surveys & Tutorials, 21(4):3039–3071, 2019.
[38] Shangyu Chen, Wenya Wang, and Sinno Jialin Pan. Metaquant: Learning to quantize
by learning to penetrate non-differentiable quantization. Proc. of NeurIPS, 32:3916–
3926, 2019.
[39] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture
search: Bridging the depth gap between search and evaluation. In Proceedings of the
IEEE/CVF international conference on computer vision, pages 1294–1303, 2019.
[40] Yankai Chen, Yifei Zhang, Huifeng Guo, Ruiming Tang, and Irwin King. An effective
post-training embedding binarization approach for fast online top-k passage matching.
In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association
182 Bibliography
for Computational Linguistics and the 12th International Joint Conference on Natural
Language Processing, pages 102–108, 2022.
[41] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. Person
re-identification by multi-channel parts-based cnn with improved triplet loss function.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1335–1344, 2016.
[42] Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, and Daniel
Soudry. Neural gradients are near-lognormal: improved quantized and sparse training.
arXiv preprint arXiv:2006.08173, 2020.
[43] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijay-
alakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping ac-
tivation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
[44] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-
image translation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 8789–8797, 2018.
[45] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What
does bert look at? An analysis of bert’s attention. arXiv preprint arXiv:1906.04341,
2019.
[46] Benoı̂t Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimiza-
tion. Annals of operations research, 153(1):235–256, 2007.
[47] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural
networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
[48] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Train-
ing deep neural networks with binary weights during propagations. Advances in neural
information processing systems, 28, 2015.
[49] Richard Crandall and Carl Pomerance. Prime numbers. Springer, 2001.
[51] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz
Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
[52] Alessio Del Bue, Joao Xavier, Lourdes Agapito, and Marco Paladini. Bilinear modeling
via augmented lagrange multipliers (balm). IEEE transactions on pattern analysis and
machine intelligence, 34(8):1496–1508, 2011.
[53] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 248–255, 2009.
[54] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
Bibliography 183
[55] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. In NAACL-
HLT, 2019.
[56] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing ac-
tivation distribution for training binarized deep networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11408–
11417, 2019.
[57] Ruizhou Ding, Zeye Liu, Rongye Shi, Diana Marculescu, and RD Blanton. Lightnn:
Filling the gap between conventional deep neural networks and binarized networks.
In Proceedings of the on Great Lakes Symposium on VLSI 2017, pages 35–40, 2017.
[58] Paul Adrien Maurice Dirac. The physical interpretation of the quantum dynamics.
Proceedings of the Royal Society of London. Series A, Containing Papers of a Math-
ematical and Physical Character, 113(765):621–641, 1927.
[59] Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and
Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural net-
works. In Neural Information Processing Systems(NeurIPS), pages 18518–18529,
2020.
[60] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929, 2020.
[64] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on
demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
[65] Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve
Jegou, and Armand Joulin. Training with quantization noise for extreme model com-
pression. arXiv preprint arXiv:2004.07320, 2020.
[66] Pedro Felzenszwalb and Ramin Zabih. Discrete optimization algorithms in computer
vision. Tutorial at IEEE International Conference on Computer Vision, 2007.
[67] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm.
In icml, volume 96, pages 148–156. Citeseer, 1996.
[68] D. Gabor. Electrical engineers part iii: Radio and communication engineering, j.
Journal of the Institution of Electrical Engineers - Part III: Radio and Communication
Engineering 1945-1948, 1946.
184 Bibliography
[71] Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, et al.
Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In Pro-
ceedings of the European Conference on Computer Vision, pages 1222–1233, 2018.
[72] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on
computer vision, pages 1440–1448, 2015.
[73] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-
chies for accurate object detection and semantic segmentation. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
[74] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin,
Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision
and low-bit neural networks. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 4852–4861, 2019.
[76] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Proceedings of the European Conference on Computer Vision, pages 2672–2680, 2014.
[77] Jiaxin Gu, Ce Li, Baochang Zhang, Jungong Han, Xianbin Cao, Jianzhuang Liu, and
David Doermann. Projection convolutional neural networks for 1-bit cnns via discrete
back propagation. In Proceedings of the AAAI Conference on Artificial Intelligence,
2019.
[78] Jiaxin Gu, Junhe Zhao, Xiaolong Jiang, Baochang Zhang, Jianzhuang Liu, Guodong
Guo, and Rongrong Ji. Bayesian optimized 1-bit cnns. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 4909–4917, 2019.
[79] Jianyuan Guo, Kai Han, Yunhe Wang, Han Wu, Xinghao Chen, Chunjing Xu, and
Chang Xu. Distilling object detectors via decoupled features. In Proc. of CVPR,
2021.
[80] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep
learning with limited numerical precision. In International conference on machine
learning, pages 1737–1746. PMLR, 2015.
[81] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint
arXiv:1609.09106, 2016.
[82] Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The ele-
ments of statistical learning: data mining, inference and prediction. The Mathematical
Intelligencer, 27(2):83–85, 2005.
Bibliography 185
[83] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.
Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377,
2021.
[84] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 770–778, 2016.
[85] Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng,
and Roeland Nusselder. Latent weights do not exist: Rethinking binarized neural
network optimization. Advances in neural information processing systems, 32, 2019.
[86] Pedro Hermosilla, Tobias Ritschel, and Timo Ropinski. Total denoising: Unsupervised
learning of 3d point cloud cleaning. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 52–60, 2019.
[87] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural
network. Computer Science, 14(7):38–39, 2015.
[88] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert:
Dynamic bert with adaptive width and depth. Advances in Neural Information Pro-
cessing Systems, 33:9782–9793, 2020.
[89] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert:
Dynamic bert with adaptive width and depth. In NeurIPs, 2020.
[90] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[91] Qinghao Hu, Peisong Wang, and Jian Cheng. From hashing to cnns: Training binary
weight networks via hashing. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32, 2018.
[92] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely
connected convolutional networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
[93] Kun Huang, Bingbing Ni, and Xiaokang Yang. Efficient quantization for neural net-
works with binary weights and low bitwidth activations. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 3854–3861, 2019.
[94] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity bench-
mark for generic object tracking in the wild. IEEE transactions on pattern analysis
and machine intelligence, 43(5):1562–1577, 2019.
[95] Yan Huang, Jingsong Xu, Qiang Wu, Zhedong Zheng, Zhaoxiang Zhang, and Jian
Zhang. Multi-pseudo regularized label for generated data in person re-identification.
IEEE Transactions on Image Processing, 28(3):1391–1403, 2018.
[96] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen,
HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient train-
ing of giant neural networks using pipeline parallelism. Advances in neural information
processing systems, 32, 2019.
186 Bibliography
[97] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural
networks. In Proc. of ECCV, pages 304–320, 2018.
[98] Zhiqi Huang, Lu Hou, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Ghostbert:
Generate more features with cheap operations for bert. In Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pages 6512–6523, 2021.
[99] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben-
gio. Binarized neural networks. Advances in neural information processing systems,
29, 2016.
[100] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Ben-
gio. Quantized neural networks: Training neural networks with low precision weights
and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
[101] Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Improving
post training neural quantization: Layer-wise calibration and integer programming.
arXiv preprint arXiv:2006.10518, 2020.
[102] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In Proceedings of International conference
on machine learning, pages 448–456, 2015.
[103] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-
lation with conditional adversarial networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1125–1134, 2017.
[104] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew
Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of
neural networks for efficient integer-arithmetic-only inference. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
[105] Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H Andrew Schwartz,
and Niranjan Balasubramanian. On the distribution, sparsity, and inference-time
quantization of attention values in transformers. arXiv preprint arXiv:2106.01335,
2021.
[106] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu. Tinybert:
Distilling bert for natural language understanding. In Findings of Empirical Methods
in Natural Language Processing, 2020.
[107] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang,
and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv
preprint arXiv:1909.10351, 2019.
[108] Amin Jourabloo and Xiaoming Liu. Pose-invariant 3d face alignment. In Proceedings
of the IEEE international conference on computer vision, pages 3694–3702, 2015.
[109] Felix Juefei-Xu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convo-
lutional neural networks. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 19–28, 2017.
Bibliography 187
[110] Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Youngjun Kwak, Jae-Joon
Han, and Changkyu Choi. Joint training of low-precision neural network with quan-
tization interval parameters. arXiv preprint arXiv:1808.05779, 2, 2018.
[111] Mahdi M Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E Kamasak, and
Mubarak Shah. Human semantic parsing for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1062–
1071, 2018.
[112] Mohammad Emtiyaz Khan and Haavard Rue. Learningalgorithms from bayesian
principles. arXiv preprint arXiv:2002.10778, 2(4), 2020.
[113] Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search
via contextualized late interaction over bert. In Proceedings of the 43rd International
ACM SIGIR conference on research and development in Information Retrieval, pages
39–48, 2020.
[114] Dahyun Kim, Kunal Pratap Singh, and Jonghyun Choi. Learning architectures for
binary networks. In Proc. of ECCV, pages 575–591, 2020.
[115] Hyungjun Kim, Kyungsu Kim, Jinseok Kim, and Jae-Joon Kim. Binaryduo: Reducing
gradient mismatch in binary activation network by coupling binary activations. In
International Conference on Learning Representations.
[116] Jangho Kim, Yash Bhalgat, Jinwon Lee, Chirag Patel, and Nojun Kwak. Qkd:
Quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.
[117] Minje Kim and Paris Smaragdis. Bitwise neural networks. arXiv preprint
arXiv:1601.06071, 2016.
[118] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-
bert: Integer-only bert quantization. In International conference on machine learning,
pages 5506–5518. PMLR, 2021.
[119] Seungryong Kim, Dongbo Min, Stephen Lin, and Kwanghoon Sohn. Dctm: Discrete-
continuous transformation matching for semantic flow. In Proceedings of the IEEE
International Conference on Computer Vision, volume 6, 2017.
[120] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local
reparameterization trick. Proceedings of the Advances in neural information processing
systems, pages 2575–2583, 2015.
[121] Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. Annotated
facial landmarks in the wild: A large-scale, real-world database for facial landmark
localization. In 2011 IEEE international conference on computer vision workshops
(ICCV workshops), pages 2144–2151. IEEE, 2011.
[122] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from
tiny images. 2009.
[123] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Proceedings of the Advances in Neural Infor-
mation Processing Systems, pages 1097–1105, 2012.
[124] Jouko Lampinen and Aki Vehtari. Bayesian approach for neural networks—review
and case studies. Neural networks, 14(3):257–274, 2001.
188 Bibliography
[125] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. Albert: A lite bert for self-supervised learning of language repre-
sentations. arXiv preprint arXiv:1909.11942, 2019.
[126] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. Albert: A lite bert for self-supervised learning of language repre-
sentations. In ICLR, 2020.
[127] Emanuel Laude, Jan-Hendrik Lange, Jonas Sch pfer, Csaba Domokos, Leal-Taix?
Laura, Frank R. Schmidt, Bjoern Andres, and Daniel Cremers. Discrete-continuous
admm for transductive inference in higher-order mrfs. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4539–
4548, 2018.
[128] Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit
neural network: Squeeze the last bit out with admm. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 3466–3473, 2018.
[129] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-
detr: Accelerate detr training by introducing query denoising. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 13619–
13627, 2022.
[130] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint
arXiv:1605.04711, 2016.
[131] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication effi-
cient distributed machine learning with the parameter server. Advances in Neural
Information Processing Systems, 27, 2014.
[132] Wei Li, Xiatian Zhu, and Shaogang Gong. Person re-identification by deep joint learn-
ing of multi-loss classification. In Proceedings of the International Joint Conference
on Artificial Intelligence, pages 2194–2200, 2017.
[133] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Factorized bilinear models
for image recognition. In Proc. of ICCV, pages 2079–2087, 2017.
[134] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen.
Pointcnn: Convolution on x-transformed points. In Proceedings of Advances in Neural
Information Processing Systems, pages 820–830, 2018.
[135] Yanjing Li, Sheng Xu, Xianbin Cao, Li’an Zhuo, Baochang Zhang, Tian Wang, and
Guodong Guo. Dcp–nas: Discrepant child–parent neural architecture search for 1-bit
cnns. International Journal of Computer Vision, pages 1–23, 2023.
[136] Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo.
Q-vit: Accurate and fully quantized low-bit vision transformer. In Advances in neural
information processing systems, 2022.
[137] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei
Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block
reconstruction. arXiv preprint arXiv:2102.05426, 2021.
[138] Zefan Li, Bingbing Ni, Wenjun Zhang, Xiaokang Yang, and Wen Gao. Performance
guaranteed network acceleration via high-order residual quantization. In Proceedings
of the IEEE International Conference on Computer Vision, pages 2584–2592, 2017.
Bibliography 189
[139] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of
drug sensitive genes. Journal of the American Statistical Association, 113(523):955–
972, 2018.
[140] Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu,
Feiyue Huang, and Chia-Wen Lin. Rotated binary neural network. In Proc. of
NeurIPS, pages 1–9, 2020.
[141] Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang
Zhang. Accelerating convolutional networks via global & dynamic filter pruning.
In Proceedings of the International Joint Conference on Artificial Intelligence, pages
2425–2432, 2018.
[142] Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang
Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning
via generative adversarial learning. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2790–2799, 2019.
[143] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and
Serge Belongie. Feature pyramid networks for object detection. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[144] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss
for dense object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.
[145] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-
manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in
context. In Proceedings of the European Conference on Computer Vision, pages 740–
755, 2014.
[146] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for
fine-grained visual recognition. In Proc. of ICCV, pages 1449–1457, 2015.
[147] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural
network. In Proceedings of the Advances in Neural Information Processing Systems,
pages 345–353, 2017.
[148] Chunlei Liu, Wenrui Ding, Yuan Hu, Baochang Zhang, Jianzhuang Liu, Guodong
Guo, and David Doermann. Rectified binary convolutional networks with generative
adversarial learning. International Journal of Computer Vision, 129:998–1012, 2021.
[149] Chunlei Liu, Wenrui Ding, Xin Xia, Baochang Zhang, Jiaxin Gu, Jianzhuang Liu,
Rongrong Ji, and David Doermann. Circulant binary convolutional networks: Enhanc-
ing the performance of 1-bit dcnns with circulant back propagation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2691–2699, 2019.
[150] Chunlei Liu, Wenrui Ding, Xin Xia, Baochang Zhang, Jiaxin Gu, Jianzhuang Liu,
Rongrong Ji, and David Doermann. Circulant binary convolutional networks: Enhanc-
ing the performance of 1-bit dcnns with circulant back propagation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
2691–2699, 2019.
190 Bibliography
[151] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture
search. In Proceedings of the International Conference on Learning Representations,
pages 1–13, 2019.
[152] Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. Investigating
bi-level optimization for learning and vision from a unified perspective: A survey and
beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[153] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Proceedings
of the European Conference on Computer Vision, pages 21–37, 2016.
[154] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin,
and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proc. of ICCV, pages 10012–10022, 2021.
[155] Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric P Xing, and Zhiqiang Shen.
Nonuniform-to-uniform quantization: Towards accurate quantization via generalized
straight-through estimation. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 4942–4952, 2022.
[156] Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghura-
man Krishnamoorthi, and Yashar Mehdad. Bit: Robustly binarized multi-distilled
transformer. In Advances In Neural Information Processing Systems, 2022.
[157] Zechun Liu, Zhiqiang Shen, Shichao Li, Koen Helwegen, Dong Huang, and Kwang-
Ting Cheng. How do adam and training strategies help bnns optimization. In Proc.
of ICML, pages 6936–6946, 2021.
[158] Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet:
Towards precise binary neural network with generalized activation functions. In Pro-
ceedings of the European Conference on Computer Vision, pages 143–159, 2020.
[159] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng.
Bi-real net: Enhancing the performance of 1-bit cnns with improved representational
capability and advanced training algorithm. In Proceedings of the European Confer-
ence on Computer Vision, pages 747–763, 2018.
[160] Zhen-Tao Liu, Si-Han Li, Min Wu, Wei-Hua Cao, Man Hao, and Lin-Bo Xian. Eye
localization based on weight binarization cascade convolution neural network. Neu-
rocomputing, 378:45–53, 2020.
[161] Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-
training quantization for vision transformer. Advances in Neural Information Pro-
cessing Systems, 34:28092–28103, 2021.
[162] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui
Zhang. Learning efficient convolutional networks through network slimming. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages 2736–2744,
2017.
[163] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks
for semantic segmentation. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 3431–3440, 2015.
Bibliography 191
[164] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Pro-
ceedings of the International Conference on Learning Representations, pages 1–18,
2017.
[165] Ziyang Luo, Artur Kulmizev, and Xiaoxi Mao. Positional artefacts propagate through
masked language model embeddings. arXiv preprint arXiv:2011.04393, 2020.
[166] X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, D. Song, and M. Zhou. A tensorized
transformer for language modeling. In Advances in Neural Information Processing
Systems, 2019.
[167] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning
models resistant to adversarial attacks. In ICLR, 2017.
[168] Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Train-
ing binary neural networks with real-to-binary convolutions. arXiv preprint
arXiv:2003.11535, 2020.
[169] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei
Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, pages 3651–
3660, 2021.
[170] Xiangming Meng, Roman Bachmann, and Mohammad Emtiyaz Khan. Training bi-
nary neural networks using the bayesian learning rule. In International conference on
machine learning, pages 6852–6861. PMLR, 2020.
[171] D Messerschmitt. Quantizing for maximum output entropy (corresp.). IEEE Trans-
actions on Information Theory, 17(5):612–612, 1971.
[172] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than
one? Advances in neural information processing systems, 32, 2019.
[173] Luca Mocerino and Andrea Calimera. Tentaclenet: A pseudo-ensemble template for
accurate binary convolutional neural networks. In 2020 2nd IEEE International Con-
ference on Artificial Intelligence Circuits and Systems (AICAS), pages 261–265. IEEE,
2020.
[174] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian
methods for seeking the extremum. Towards global optimization, 2(117-129):2, 1978.
[175] Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing
magazine, 13(6):47–60, 1996.
[176] Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la
Société mathématique de France, 93:273–299, 1965.
[177] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for
uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amster-
dam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 445–461.
Springer, 2016.
[178] Prasanna Kumar Muthukumar and Alan W Black. A deep learning approach to data-
driven parameterizations for statistical parametric speech synthesis. arXiv preprint
arXiv:1409.8558, 2014.
192 Bibliography
[179] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen
Blankevoort. Up or down? adaptive rounding for post-training quantization. In In-
ternational Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
[180] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free
quantization through weight equalization and bias correction. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
[181] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Ng. Reading digits in natural images with unsupervised feature learning. 2011.
[182] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Ma-
jumder, and Li Deng. Ms marco: A human generated machine reading comprehension
dataset. In CoCo@ NIPs, 2016.
[183] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,
Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet:
A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[184] Nikunj C Oza and Stuart J Russell. Online bagging and boosting. In International
Workshop on Artificial Intelligence and Statistics, pages 229–236. PMLR, 2001.
[185] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic
differentiation in pytorch. In Proceedings of the Advances in Neural Information
Processing Systems Workshops, pages 1–4, 2017.
[186] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:
An imperative style, high-performance deep learning library. In Advances in Neural
Information Processing Systems, pages 8026–8037, 2019.
[187] KB Petersen, MS Pedersen, et al. The matrix cookbook. Technical University of
Denmark, 15, 2008.
[188] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural
architecture search via parameters sharing. In International conference on machine
learning, pages 4095–4104. PMLR, 2018.
[189] Hai Phan, Zechun Liu, Dang Huynh, Marios Savvides, Kwang-Ting Cheng, and
Zhiqiang Shen. Binarizing mobilenet via evolution-based searching. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
13420–13429, 2020.
[190] Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh. Fully quantized trans-
former for machine translation. arXiv preprint arXiv:1910.10485, 2019.
[191] Juan C. Pérez, Motasem Alfarra, Guillaume Jeanneret, Adel Bibi, Ali Kassem Thabet,
Bernard Ghanem, and Pablo Arbeláez. Robust gabor networks. arXiv, 2019.
[192] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning
on point sets for 3d classification and segmentation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
[193] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep
hierarchical feature learning on point sets in a metric space. In Proceedings of Advances
in Neural Information Processing Systems, pages 5099–5108, 2017.
Bibliography 193
[194] Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi,
Xianglong Liu, and Hao Su. Bipointnet: Binary neural network for point clouds. In
Proceedings of the International Conference on Learning Representations, 2021.
[195] Haotong Qin, Yifu Ding, Mingyuan Zhang, Qinghua Yan, Aishan Liu, Qingqing Dang,
Ziwei Liu, and Xianglong Liu. Bibert: Accurate fully binarized bert. arXiv preprint
arXiv:2203.06390, 2022.
[196] Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu,
and Jingkuan Song. Forward and backward information retention for accurate binary
neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2250–2259, 2020.
[197] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey
Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances
in Neural Information Processing Systems, 34:12116–12128, 2021.
[198] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad:
100,000+ questions for machine comprehension of text. arXiv preprint
arXiv:1606.05250, 2016.
[199] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net:
Imagenet classification using binary convolutional neural networks. In Proceedings of
the European Conference on Computer Vision, pages 525–542, 2016.
[200] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767, 2018.
[201] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Proceedings of the Advances
in Neural Information Processing Systems, pages 91–99, 2015.
[202] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and
Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding
box regression. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 658–666, 2019.
[203] Ergys Ristani and Carlo Tomasi. Features for multi-target multi-camera tracking and
re-identification. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6036–6046, 2018.
[204] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet
large scale visual recognition challenge. International journal of computer vision,
115:211–252, 2015.
[205] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[206] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,
a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint
arXiv:1910.01108, 2019.
194 Bibliography
[207] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How
does batch normalization help optimization? In Proceedings of Advances in neural
information processing systems, pages 1–11, 2018.
[208] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer.
Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the
AAAI Conference on Artificial Intelligence, 2020.
[209] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W
Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization
of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34,
pages 8815–8821, 2020.
[210] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks
via information. arXiv:1703.00810, 2017.
[211] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. In Proceedings of the International Conference on Learning
Representations, pages 1–15, 2015.
[212] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-
driven deep convolutional model for person re-identification. In Proceedings of the
IEEE International Conference on Computer Vision, pages 3960–3969, 2017.
[213] Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep attributes driven
multi-camera person re-identification. In Proceedings of the European Conference on
Computer Vision, pages 475–491, 2016.
[214] Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned
bilinear representations for person re-identification. In Proceedings of the European
Conference on Computer Vision, pages 402–419, 2018.
[215] Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight
uncertainty in bayesian neural networks. In Proceedings of the Artificial Intelligence
and Statistics, pages 1283–1292, 2017.
[216] Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational
bayesian neural networks. In Proceedings of the International Conference on Learning
Representations, pages 1–22, 2019.
[217] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for
bert model compression. arXiv preprint arXiv:1908.09355, 2019.
[218] Siyang Sun, Yingjie Yin, Xingang Wang, De Xu, Wenqi Wu, and Qingyi Gu. Fast ob-
ject detection based on binary deep convolution neural networks. CAAI transactions
on intelligence technology, 3(4):191–197, 2018.
[219] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models:
Person retrieval with refined part pooling (and a strong convolutional baseline). In
Proceedings of the European Conference on Computer Vision, pages 480–496, 2018.
[220] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, pages 1–9, 2015.
Bibliography 195
[221] Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, and
Ngai Wong. Compression of generative pre-trained language models via quantization.
arXiv preprint arXiv:2203.10705, 2022.
[222] Jiayi Tian, Chao Fang, Haonan Wang, and Zhongfeng Wang. Bebert: Efficient and
robust binary ensemble bert. arXiv preprint arXiv:2210.15976, 2022.
[223] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck
method. arXiv preprint physics/0004057, 2000.
[224] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay-
rolles, and Hervé Jégou. Training data-efficient image transformers & distillation
through attention. In International conference on machine learning, pages 10347–
10357. PMLR, 2021.
[226] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proc.
of ICCV, pages 1365–1374, 2019.
[227] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[228] Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen.
Tbn: Convolutional neural network with ternary inputs and binary weights. In Pro-
ceedings of the European Conference on Computer Vision (ECCV), pages 315–332,
2018.
[229] Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen.
Tbn: Convolutional neural network with ternary inputs and binary weights. In Pro-
ceedings of the European Conference on Computer Vision, pages 315–332, 2018.
[230] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
Bowman. Glue: A multi-task benchmark and analysis platform for natural language
understanding. arXiv preprint arXiv:1804.07461, 2018.
[231] Guo-Hua Wang, Yifan Ge, and Jianxin Wu. Distilling knowledge by mimicking fea-
tures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[232] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable joint attribute-
identity deep learning for unsupervised person re-identification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 2275–2284,
2018.
[233] Peisong Wang, Qinghao Hu, Yifan Zhang, Chunjie Zhang, Yang Liu, and Jian Cheng.
Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Con-
ference on computer vision and pattern recognition, pages 4376–4384, 2018.
[234] Song Wang, Dongchun Ren, Li Chen, Wei Fan, Jun Sun, and Satoshi Naoi. On
study of the binarized deep neural network for image classification. arXiv preprint
arXiv:1602.07373, 2016.
196 Bibliography
[235] Tao Wang, Li Yuan, Xiaopeng Zhang, and Jiashi Feng. Distilling object detectors
with fine-grained feature imitation. In Proc. of CVPR, 2019.
[236] Xiaodi Wang, Baochang Zhang, Ce Li, Rongrong Ji, Jungong Han, Xianbin Cao,
and Jianzhuang Liu. Modulated convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 840–848, 2018.
[237] Xiaodi Wang, Baochang Zhang, Ce Li, Rongrong Ji, Jungong Han, Xianbin Cao, and
Jianzhuang Liu. Modulated convolutional networks. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 840–848, 2018.
[238] Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, and Gao Huang. Revisiting locally
supervised learning: an alternative to end-to-end training. In Proceedings of the In-
ternational Conference on Learning Representations, pages 1–21, 2021.
[239] Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian. Learning channel-
wise interactions for binary convolutional neural networks. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 568–577,
2019.
[240] Ziwei Wang, Ziyi Wu, Jiwen Lu, and Jie Zhou. Bidet: An efficient binarized object
detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 2049–2058, 2020.
[241] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge
domain gap for person re-identification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 79–88, 2018.
[242] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, and Nanning Zheng. Grassmann
pooling as compact homogeneous bilinear pooling for fine-grained visual classification.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 355–
370, 2018.
[243] Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang,
Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of
low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
[244] Liangjian Wen, Xuanyang Zhang, Haoli Bai, and Zenglin Xu. Structured pruning of
recurrent neural networks through neuron selection. Neural Networks, 123:134–141,
2020.
[245] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature
learning approach for deep face recognition. In Proceedings of the European Conference
on Computer Vision, pages 499–515, 2016.
[246] Ronald J Williams and David Zipser. A learning algorithm for continually running
fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
[247] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adver-
sarial training. In ICLR, 2020.
[248] Lin Wu, Yang Wang, Junbin Gao, and Xue Li. Where-and-when to look: Deep siamese
attention networks for video-based person re-identification. IEEE Transactions on
Multimedia, 21(6):1412–1424, 2018.
Bibliography 197
[249] Nailong Wu. The maximum entropy method, volume 32. Springer Science & Business
Media, 2012.
[250] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 2411–2418, 2013.
[251] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE
Transactions on Pattern Analysis & Machine Intelligence, 37(09):1834–1848, 2015.
[252] Xu Xiang, Yanmin Qian, and Kai Yu. Binary deep neural networks for speech recog-
nition. In INTERSPEECH, pages 533–537, 2017.
[253] C. Xie, Y. Wu, L. V. D. Maaten, A. L. Yuille, and K. He. Feature denoising for
improving adversarial robustness. In CVPR, 2019.
[254] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural archi-
tecture search. arXiv preprint arXiv:1812.09926, 2018.
[255] Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic
early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
[256] Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang, and Ulrich Neumann. Grid-
gcn for fast and scalable point cloud learning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5661–5670, 2020.
[257] Sheng Xu, Yanjing Li, Mingbao Lin, Peng Gao, Guodong Guo, Jinhu Lü, and
Baochang Zhang. Q-detr: An efficient low-bit quantized detection transformer. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pages 3842–3851, 2023.
[258] Sheng Xu, Yanjing Li, Teli Ma, Mingbao Lin, Hao Dong, Baochang Zhang, Peng
Gao, and Jinhu Lu. Resilient binary neural network. In Proceedings of the AAAI
Conference on Artificial Intelligence, pages 10620–10628, 2023.
[259] Sheng Xu, Yanjing Li, Tiancheng Wang, Teli Ma, Baochang Zhang, Peng Gao,
Yu Qiao, Jinhu Lü, and Guodong Guo. Recurrent bilinear optimization for binary
neural networks. In European Conference on Computer Vision, pages 19–35. Springer,
2022.
[260] Sheng Xu, Yanjing Li, Bohan Zeng, Teli Ma, Baochang Zhang, Xianbin Cao, Peng
Gao, and Jinhu Lü. Ida-det: An information discrepancy-aware distillation for 1-bit
detectors. In European Conference on Computer Vision, pages 346–361. Springer,
2022.
[261] Sheng Xu, Yanjing Li, Junhe Zhao, Baochang Zhang, and Guodong Guo. Poem: 1-
bit point-wise operations based on expectation-maximization for efficient point cloud
processing. In Proceedings of the British Machine Vision Conference, 2021.
[262] Sheng Xu, Chang Liu, Baochang Zhang, Jinhu Lü, Guodong Guo, and David Doer-
mann. Bire-id: Binary neural network for efficient person re-id. ACM Transactions
on Multimedia Computing, Communications, and Applications (TOMM), 18(1s):1–22,
2022.
198 Bibliography
[263] Sheng Xu, Zhendong Liu, Xuan Gong, Chunlei Liu, Mingyuan Mao, and Baochang
Zhang. Amplitude suppression and direction activation in networks for 1-bit faster
r-cnn. In Proceedings of the 4th International Workshop on Embedded and Mobile
Deep Learning, pages 19–24, 2020.
[264] Sheng Xu, Junhe Zhao, Jinhu Lu, Baochang Zhang, Shumin Han, and David Doer-
mann. Layer-wise searching for 1-bit detectors. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 5682–5691, 2021.
[265] Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai
Xiong. Pc-darts: Partial channel connections for memory-efficient architecture search.
arXiv preprint arXiv:1907.05737, 2019.
[266] Zhe Xu and Ray CC Cheung. Accurate and compact convolutional neural networks
with trained binarization. arXiv preprint arXiv:1909.11366, 2019.
[267] Zihan Xu, Mingbao Lin, Jianzhuang Liu, Jie Chen, Ling Shao, Yue Gao, Yonghong
Tian, and Rongrong Ji. Recu: Reviving the dead weights in binary neural networks.
arXiv preprint arXiv:2103.12369, 2021.
[268] Haojin Yang, Martin Fritzsche, Christian Bartz, and Christoph Meinel. Bmxnet: An
open-source binary neural network implementation based on mxnet. In Proceedings
of the 25th ACM international conference on Multimedia, pages 1209–1212, 2017.
[269] Li Yang, Zhezhi He, and Deliang Fan. Binarized depthwise separable neural network
for object tracking in fpga. In Proceedings of the 2019 on Great Lakes Symposium on
VLSI, pages 347–350, 2019.
[270] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-
based analysis of large batch training and robustness to adversaries. Advances in
Neural Information Processing Systems, 31, 2018.
[271] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Deep metric learning for person re-
identification. In Proceedings of the International Conference on Pattern Recognition,
pages 34–39, 2014.
[272] Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack
Xin. Binaryrelax: A relaxation approach for training deep neural networks with quan-
tized weights. SIAM Journal on Imaging Sciences, 11(4):2205–2223, 2018.
[273] Shouyi Yin, Peng Ouyang, Shixuan Zheng, Dandan Song, Xiudong Li, Leibo Liu, and
Shaojun Wei. A 141 uw, 2.46 pj/neuron binarized convolutional neural network based
self-learning speech recognition processor in 28nm cmos. In 2018 IEEE Symposium
on VLSI Circuits, pages 139–140. IEEE, 2018.
[274] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter. Nas-bench-
101: Towards reproducible neural architecture search. In ICML, 2019.
[275] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojana-
palli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch
optimization for deep learning: Training bert in 76 minutes. Proc. of ICLR, pages
1–37, 2020.
[276] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unit-
box: An advanced object detection network. In Proceedings of the 24th ACM inter-
national conference on Multimedia, pages 516–520, 2016.
Bibliography 199
[277] Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salz-
mann. Evaluating the search phase of neural architecture search. arXiv preprint
arXiv:1902.08142, 2019.
[278] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear
pooling with co-attention learning for visual question answering. In Proc. of ICCV,
pages 1821–1830, 2017.
[279] Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. Gobo:
Quantizing attention-based nlp models for low latency and energy efficient inference.
In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MI-
CRO), pages 811–824. IEEE, 2020.
[280] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized
8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cogni-
tive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
[281] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of
the British Machine Vision Conference, pages 1–15, 2016.
[282] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[283] Baochang Zhang, Alessandro Perina, Zhigang Li, Vittorio Murino, Jianzhuang Liu,
and Rongrong Ji. Bounding multiple gaussians uncertainty with application to object
tracking. International journal of computer vision, 118:364–379, 2016.
[284] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned
quantization for highly accurate and compact deep neural networks. In Proceedings
of the European conference on computer vision (ECCV), pages 365–382, 2018.
[285] Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu.
Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812,
2020.
[286] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An ex-
tremely efficient convolutional neural network for mobile devices. arXiv preprint
arXiv:1707.01083, 2017.
[287] Junhe Zhao, Sheng Xu, Baochang Zhang, Jiaxin Gu, David Doermann, and Guodong
Guo. Towards compact 1-bit cnns via bayesian learning. International Journal of
Computer Vision, pages 1–25, 2022.
[288] Feng Zheng, Cheng Deng, and Heng Huang. Binarized neural networks for resource-
efficient hashing with minimizing quantization loss. In IJCAI, pages 1032–1040, 2019.
[289] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian.
Scalable person re-identification: A benchmark. Proceedings of the IEEE International
Conference on Computer Vision, pages 1116–1124, 2015.
[290] Shixuan Zheng, Peng Ouyang, Dandan Song, Xiudong Li, Leibo Liu, Shaojun Wei, and
Shouyi Yin. An ultra-low power binarized convolutional neural network-based speech
recognition processor with on-chip self-learning. IEEE Transactions on Circuits and
Systems I: Regular Papers, 66(12):4648–4661, 2019.
200 Bibliography
[291] Xiawu Zheng, Rongrong Ji, Lang Tang, Yan Wan, Baochang Zhang, Yongjian Wu,
Yunsheng Wu, and Ling Shao. Dynamic distribution pruning for efficient network
architecture search. arXiv preprint arXiv:1905.13543, 2019.
[292] Xiawu Zheng, Rongrong Ji, Lang Tang, Baochang Zhang, Jianzhuang Liu, and
Qi Tian. Multinomial distribution learning for effective neural architecture search.
In Proc. of ICCV, 2019.
[293] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren.
Distance-iou loss: Faster and better learning for bounding box regression. In Proceed-
ings of the AAAI conference on artificial intelligence, volume 34, pages 12993–13000,
2020.
[294] Zhedong Zheng, Liang Zheng, and Yi Yang. generated by gan improve the person re-
identification baseline in vitro. In Proceedings of the IEEE International Conference
on Computer Vision, pages 3754–3762, 2017.
[295] Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. Generalizing a person retrieval
model hetero-and homogeneously. In Proceedings of the European conference on com-
puter vision (ECCV), pages 172–188, 2018.
[296] Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. Invariance matters:
Exemplar memory for domain adaptive person re-identification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 598–607, 2019.
[297] Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc
Le, Qiang Liu, and Dale Schuurmans. Go wide, then narrow: Efficient training of deep
thin networks. In International Conference on Machine Learning, pages 11546–11555.
PMLR, 2020.
[298] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei.
Bert loses patience: Fast and robust inference with early exit. Advances in Neural
Information Processing Systems, 33:18330–18341, 2020.
[299] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quanti-
zation. In Proceedings of the International Conference on Learning Representations,
pages 1–10, 2017.
[300] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
image translation using cycle-consistent adversarial networks. In Proceedings of the
IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
[301] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural network: More bits per
network or more networks per bit? In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4923–4932, 2019.
[302] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment
across large poses: A 3d solution. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 146–155, 2016.
[303] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Structured
binary neural networks for accurate image classification and semantic segmentation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 413–422, 2019.
Bibliography 201
[304] Li’an Zhuo, Baochang Zhang, Hanlin Chen, Linlin Yang, Chen Chen, Yanjun Zhu,
and David Doermann. Cp-nas: Child-parent neural architecture search for 1-bit cnns.
In Proceedings of the Twenty-Ninth International Conference on International Joint
Conferences on Artificial Intelligence, pages 1033–1039, 2020.
[305] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.
In arXiv, 2016.
[306] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning trans-
ferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 8697–8710, 2018.
[307] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transfer-
able architectures for scalable image recognition. In Proc. of CVPR, pages 8697–8710,
2018.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Index
203
204 Index