XOR-Net An Efficient Computation Pipeline For Binary Neural Network Inference On Edge Devices

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)

XOR-Net: An Efficient Computation Pipeline for Binary Neural Network


Inference on Edge Devices

Shien Zhu, Luan H. K. Duong, Weichen Liu


School of Computer Science and Engineering, Nanyang Technological University, Singapore
2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) | 978-1-7281-9074-7/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICPADS51040.2020.00026

Email: [email protected], {lhkduong, liu}@ntu.edu.sg

Abstract—Accelerating the inference of Convolution Neural brings lower latency because low-precision computations are
Networks (CNNs) on edge devices is essential due to the small faster than full-precision ones and may reduce the hardware
memory size and poor computation capability of these devices. complexity of accelerators. For example, QNNPACK [5]
Network quantization methods such as XNOR-Net, Bi-Real-
Net, and XNOR-Net++ reduce the memory usage of CNNs utilizes 8-bit integers and achieves high-performance CNN
by binarizing the CNNs. They also simplify the multiplication inference on mobile platforms.
operations to bit-wise operations and obtain good speedup Among these quantization methods, Binary Neural Net-
on edge devices. However, there are hidden redundancies in work [6] quantizes both the filters and the activations to 1-bit
the computation pipeline of these methods, constraining the
speedup of those binarized CNNs.
numbers and achieves extreme storage-saving and speedup.
In this paper, we propose XOR-Net as an optimized com- As both the activations and the filters are ’0’s and ’1’s,
putation pipeline for binary networks both without and with the binary convolution is now equivalent to the XNOR
scaling factors. As XNOR is realized by two instructions XOR operations followed by pop-count (counting the number
and NOT on CPU/GPU platforms, XOR-Net avoids NOT oper- of ”1”s in a binary integer) operations, resulting in much
ations by using XOR instead of XNOR, thus reduces bit-wise
faster computation speed than the convolution using 32-bit
operations in both aforementioned kinds of binary convolution
layers. For the binary convolution with scaling factors, our float point Multiplication Accumulation (MAC) operations.
XOR-Net further rearranges the computation sequence of However, binarization brings large accuracy loss. So XNOR-
calculating and multiplying the scaling factors to reduce full- Net [7] and XNOR-Net++ [8] add 32-bit full-precision
precision operations. Theoretical analysis shows that XOR-Net scaling factors to the quantized filters and activations to
reduces one-third of the bit-wise operations compared with
improve the accuracy of binarized networks.
traditional binary convolution, and up to 40% of the full-
precision operations compared with XNOR-Net. Experimental However, not only these works but also latest works such
results show that our XOR-Net binary convolution without as Bi-Real-Net [9] and CI-BCNN [10] take XNOR and pop-
scaling factors achieves up to 135× speedup and consumes no count operations for binary convolution without considering
more than 0.8% energy compared with parallel full-precision the hardware implementation. As there is no XNOR instruc-
convolution. For the binary convolution with scaling factors,
XOR-Net is up to 17% faster and 19% more energy-efficient tion on most CPU and GPU platforms [11]–[14], they have
than XNOR-Net. to conduct XNOR using two instructions (XOR and NOT)
instead of one and degrade the speedup of binary neural
Keywords-CNN Acceleration; Neural Network Quantization;
Binary Neural Networks; Edge Devices; networks. Besides, there exist many hidden redundancies in
XNOR-Net when calculating the scaling factors and getting
I. I NTRODUCTION the final output, which are hardly noticed unless viewing
the combined execution process in consecutive layers. We
Deep Convolution Neural Networks (CNNs) are both have observed that these redundant full-precision operations
computation and memory intensive. For example, state-of- account for 25-40% FLOPS in convolution layers and seri-
the-art CNN EfficientNet-B7 [1] has 66 million parameters ously affect the computation efficiency.
with 37 billion Float-Point Operations (FLOPS). Such large In this paper, we propose XOR-Net as an efficient binary
CNNs are hard to be deployed on edge devices with lim- network inference method to solve these problems. First,
ited computation resources and small memory. Therefore, as XOR is a universal instruction in off-the-shelf CPU
acceleration methods including transformation [2], pruning and GPU platforms [11]–[14], XOR-Net uses XOR instead
[3], and quantization [4] have been proposed to alleviate this of XNOR for binary convolution and finishes the bit-wise
execution problem. operation within one cycle instead of two, reducing all the
Among the acceleration methods, quantization utilizes NOT operations compared with traditional methods. Second,
low-precision numbers instead of 32/64-bit floating-point for those binary networks with scaling factors, XOR-Net
numbers to represent the weights and activations, thus re- moves the multiplication of the scaling factor matrix to
duces the storage cost for CNN inference. Quantization also the next layer and moves constants to the scaling factor
of the weights, so XOR-Net further reduces up to 40%

2690-5965/20/$31.00 ©2020 IEEE 124


DOI 10.1109/ICPADS51040.2020.00026

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
full-precision operations compared with XNOR-Net. By III. M OTIVATION
carefully modifying the following computation after the bit- Existing methods such as BNN, Bi-Real-Net, and CI-
wise operations, XOR-Net produces the same convolution BCNN use binary convolution without scaling factors,
results and keeps the same neural network accuracy as whose general steps include binarizing filter weights, bina-
traditional binary methods without bringing new overhead. rizing the input activations, and performing bit-wise convo-
Our proposed XOR-Net can be implemented on all lution. XNOR-Net and XNOR-Net++ use binary convolution
general-purpose platforms including CPU and GPU which with scaling factors, whose basic steps also include calcu-
provide XOR and pop-count instructions. We implement lating the scaling factors of the filters and the activations
XOR-Net on a RISC-V based edge device GreenWaves as well as multiplying the scaling factors to the bit-wise
GAP8, taking advantage of reduced branching operations, convolution results.
loop unrolling, and bit-level and filter-level parallelism. We
evaluate the actual speedup and energy consumption of A. BCNN: Binary Convolution without Scaling Factors
XOR-Net on the edge device at the layer level with different The binarization of input activations X and weights W is
configurations on the input size, the input channel, and usually realized by getting the sign bits and then packing the
the filter number. Experimental results show that XOR-Net single sign bits into 32/64-bit integers. So there will be high
achieves 81-135× speedup and consumes no more than 0.8% data parallelism when convoluting the quantized activations
energy compared with the parallel full-precision layers. For QX and weights QW to get the layer outputs O.
binary convolution with scaling factors, XOR-Net is 10-
17% faster and 19% more energy-efficient than XNOR-Net. QX = sign(X) (1)
Please note that the speedup and energy efficiency are gained QW = sign(W ) (2)
without any accuracy cost. O = QX ∗ QW (3)
The rest of this paper is organized as follows. Section
2 introduces related works and Section 3 discusses ex- As the quantized activations and weights are all +1 and
isting binary convolution methods and our observations. -1 represented by ”0” and ”1”, the dot product inside
Section 4 describes our proposed XOR-Net with a theoretical the convolution can be finished using XNOR and pop-
analysis provided, while Section 5 details our XOR-Net count. Suppose we need to calculate the dot product of
binary convolution implementation. Section 6 provides the two vectors VQX and VQW from the quantized activations
experimental results and Section 7 concludes our works. and quantized weights respectively as equation (4)-(7) show.
XNOR operations get ”1” when the operands are both ”0”s
or ”1”s, so the summation of the pop-count result sum
II. R ELATED W ORKS stands for the number of +1 in the dot product result. N is
the total bits of the XNOR result, so N −sum is the number
Quantization is a popular CNN acceleration method. For of -1 in the dot product result. Finally, the dot product result
example, 16/8-bit quantization has been adopted by deep is obtained as equation (6)-(7) show.
learning frameworks TensorFlow Lite and PyTorch, and
hardware accelerators Google TPUs and Nvidia GPUs. sum = popcnt(VQX XN OR VQW ) (4)
Among different quantization methods, 1-bit quantization = popcnt(N OT (VQX XOR VQW )) (5)
is an extreme case, which has been proposed and studied VQX · VQW = sum − (N − sum) (6)
by many works. Binary Connect [15] quantizes the weights = 2 × sum − N (7)
into {+1, -1} to replace the multiplication operations with
additions and subtractions. BNN [6] quantizes both the input We notice that the XNOR is realized using XOR and NOT
activations and weights into one bit to achieve the extreme operations inside CPU and GPU platforms [11]–[14], so it is
speedup. XNOR-Net [7] and XNOR-Net++ [8] add scaling not efficient to use XNOR for the quantized convolution. We
factors to the binarized weights and activations to improve can reduce the number of bit-wise operations in the binary
the accuracy. Bi-Real-Net [9] adds real value activation short convolution by using XOR instead of XNOR. Thus we
cuts to improve the information representation ability and propose another convolution scheme called XOR-Net which
CI-BCNN [10] mines channel-wise interactions to reduce the uses XOR and pop-count to achieve the same functionality
sign error, but these works focus on improving the accuracy without the NOT operation.
without considering the inefficiency of XNOR operations.
Meanwhile, BMXNet v1 and v2 [16], [17] implement binary B. XNOR-Net: Binary Convolution with Scaling Factors
convolution layers including XNOR-Net in MXNet and 1) Calculating the Scaling Factors: In XNOR-Net, the
optimize the GEMM kernels for high speedups, but neither scaling factor α of a filter is the average absolute value of
of the works has noticed the computation redundancy in its weights. W ∈ Rc×kh ×kw is the weight tensor with three
XNOR-Net nor removed them. dimensions: the channel, the kernel height, and the kernel

125

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
width. n = c × kh × kw is the total number of weights in
a filter. The scaling factors of filters are calculated during
training, so they have no computation cost in inference.
1
α= W L1 (8)
n
(a)
As equation (9)-(10) show, X ∈ Rc×h×w is the 32-bit
full-precision input tensor with three dimensions: channel,
height, and width. The original XNOR-Net performs the
following steps to get the scaling factor matrix of the input (b)
activations. First, calculate the average absolute value matrix
A ∈ Rh×w of the input tensor X across the channel. Second, Figure 1. Illustration of the computation pipeline of binary convolution
with scaling factors in consecutive layers. (a) The original XNOR-Net
do a fake convolution between A and I ∈ {1}kh ×kw , a algorithm. (b) The proposed optimized XOR-Net algorithm.
matrix full of ones with the same size as kernels. The fake
convolution adds up the corresponding elements of A and universal instruction in off-the-shelf CPU and GPU plat-
gets the scaling factor matrix K ∈ Roh ×ow of the activations. forms, and pop-count as the main operations in the bit-wise
c convolution. Because both the binary convolution layers with
1
A= Xj,:,:  (9) and without scaling factors need to conduct bit-wise convo-
c j=1 lution, XOR-Net is applicable to both of them. Especially,
1 for the binary convolution with scaling factors like XNOR-
K= A∗I (10) Net, XOR-Net rearranges the computation pipeline based on
kh kw
the motivations mentioned in the previous section and further
However, since the numbers of the input channel, the
reduces the number of full-precision operations dealing with
kernel height, and the kernel width are known before getting
scaling factors.
the input activations, there is no need to multiply constants 1c
and kw1kh when calculating each scaling factor matrix of the A. XOR-Net: Binary Convolution without Scaling Factors
activations. We can move these two constants to the scaling
XOR-Net keeps the same computation sequence as equa-
factors of the weights that are calculated during training to
tion (1)-(3) but changes the dot product inside the bit-wise
reduce the computation cost of inference.
convolution according to equation (13)-(15). As the vectors
2) Multiplying the Scaling Factors: In the last step of
VQX and VQW of quantized activations and quantized
XNOR-Net, the bit-wise convolution result needs to multiply
weights are all +1 and -1 represented by ”0” and ”1”,
the scaling factors mentioned in the previous subsection.
XOR operations get ”1” when the operands include one
The bit-wise convolution result O ∈ Noh ×ow multiplies the
”0” and one ”1”, so the summation of the pop-count result
scaling factor matrix K of the input activation in an element-
sum stands for the number of -1 in the dot product result.
wise manner , then multiplies the scaling factor α of the
Similarly, N is the total bits of the XOR result, so N − sum
filter weights. After all, we get Y ∈ Roh ×ow the final output
is the number of +1 in the dot product result. Therefore,
of convolution with one filter.
we obtain the dot product result using only one XOR and
O = QX ∗ QW (11) one pop-count. Compared with existing methods, XOR-Net
Y =OK ·α (12) reduces one NOT operation for all the quantized activations.
sum = popcnt(VQX XOR VQW ) (13)
The original XNOR-Net produces the final output by
multiplying the same scaling factor matrix K on the bit-wise VQX · VQW = (N − sum) − sum (14)
convolution result with every filter O. As Fig.1 (a) shows = N − 2 × sum (15)
in the bracket, this leads to many repeating multiplications
at the matrix level. Viewing the whole computation pipeline B. XOR-Net-S: Binary Convolution with Scaling Factors
across layers in Fig.1 (a), we observe that there is an Avg. For binary convolution with scaling factors, our XOR-Net
function (get the average absolute values across the channel) moves the multiplication of some constants to the scaling
in the next layer. Therefore, we can move the element-wise factor of the weights based on the motivation in Section
multiplication K outside the average function to remove the III.B. XOR-Net also changes the way to produce the output
unnecessary calculations as Fig.1 (b) shows. in consecutive convolution layers as Fig.1 (b) shows.
1 1 1
IV. P ROPOSED B INARY C ONVOLUTION α = α= W L1 (16)
kh kw c ckh kw n
We propose XOR-Net in this paper to avoid using the Equation (16) shows the calculation of the new scaling
NOT operations repeatedly. Our XOR-Net utilizes XOR, a factor α of the filter weights. Usually, α is determined

126

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
during training, so we do not need to calculate it any D. Accuracy and Limitation
more in the inference. Our method can be applied to pre-
trained original XNOR-Net models by calculating α from Our method maintains the same accuracy as the original
the original scaling factor α, which is only a one-time cost. methods like BNN, Bi-Real-Net and XNOR-Net. XOR-
c
Net is an efficient computation pipeline for binary network

A = X:,:,j  (17) inference, and it provides the same output as shown in
j=1
the mathematical equations, so XOR-Net keeps the same
 accuracy as the training schemes. If binary networks achieve
A ∗ I, the first layer
K (i) = (18) higher accuracy with new training methods, doing inference
(A  K (i−1) ∗ I, otherwise

using XOR-Net will achieve the same high accuracy.
O = sign(X)sign(W ) (19) The limitation of XOR-Net lies in the binary convolution

O · α , not the last layer with scaling factors. Changing the computation sequence in
Y =X ∗W ≈ (20)
O  K (i) · α , otherwise XNOR-Net involves conveying the scaling factor matrix to
Equations (17)-(20) show the calculating of scaling factor the next convolution layer, which means, the layer outputs
matrix of the activations and getting the layer output. Our in the first and consecutive layers are not the same final
method follows the same basic logic as the original XNOR- results as the original XNOR-Net in these layers. This will
Net algorithm and optimizes the computation pipeline across not affect activation layers such as ReLU and Leaky ReLU.
layers in convolution blocks. For the first convolution layer Because the scaling factor matrix only consists of positive
using our method, the average absolute value matrix A will numbers, the ReLU output only relies on the convolution
do the fake convolution to get the scaling factor matrix K. result which contains both negative and positive numbers.
Then the intermediate result O will not multiply the scaling However, we have to multiply the scaling factor matrix
factor matrix K, but convey it to the next layer. If this is and cannot get high speedup compared with XNOR-Net
a consecutive layer after the first one, A will multiply the in those convolution layers before pooling layers including
scaling factor matrix of the previous layer K (i−1) to get Max Pooling and Average Pooling (e.g. VGG-16 layer 4).
the scaling factor matrix of this layer K (i) , and K (i) will Because the input shape is different after pooling, the scaling
be conveyed to the next layer as well. Only when it’s the factor matrix will not fit into the next convolution layer.
convolution layer before pooling layers or at the end of the Table I
F ULL - PRECISION MAC OPERATION REDUCTION RATIO OF THE
convolution block, the intermediate result O multiplies the PROPOSED XOR-N ET ALGORITHM
scaling factor matrix K (i) to get the layer output.
VGG-16 Input size Filter size Ratio
C. Theoretical Operation Reduction
Layer 3 64,112,112 128,64,3,3 39.39%
XOR-Net uses XOR and pop-count instead of XOR, Layer 4 128,112,112 128,128,3,3 0.25%
NOT and pop-count for bit-wise convolution to remove the Layer 5 128,56,56 256,128,3,3 39.69%
computation redundancy introduced by XNOR. So XOR-Net Layer 6 256,56,56 256,256,3,3 33.03%
reduces one-third of bit-wise operations in binary convolu- Layer 7 256,56,56 236,256,3,3 0.13%
tion layers both with and without scaling factors. As for the Layer 8 256,28,28 512,256,3,3 39.84%
full-precision operation reduction in binary convolution with Layer 9 512,28,28 512,512,3,3 33.18%
scaling factors, we take VGG-16 and YOLOv2 as example Layer 10 512,28,28 512,512,3,3 0.06%
CNNs to compare with the original XNOR-Net. Our XOR- Layer 11 512,14,14 512,512,3,3 33.25%
Net-S is also applicable to more complicated CNNs such Layer 12 512,14,14 512,512,3,3 33.18%
as inception and residual blocks as long as we multiply the Layer 13 512,14,14 512,512,3,3 0.06%
scaling factor matrix in convolution layers before pooling,
YOLOv2 Input size Filter size Ratio
concatenation and addition layers/nodes. Table I lists the
convolution layers of the example CNNs in blocks according Layer 3 64,56,56 128,64,3,3 39.39%
to the pooling layers in between. Layer 4 128,56,56 64,128,1,1 25.19%
Compared with the original XNOR-Net, XOR-Net Layer 5 64,56,56 128,64,3,3 0.30%
achieves up to 40% full-precision operation reduction in Layer 6 128,28,28 256,128,3,3 39.69%
convolution layers with 3 × 3 kernels and about 25% full- Layer 7 256,28,28 128,256,1,1 25.10%
precision operation reduction in those with 1×1 kernels. The Layer 8 128,28,28 256,128,3,3 0.15%
convolution layers before pooling layers (such as layer 4, 7, Layer 9 256,14,14 512,256,3,3 39.84%
10 and 13 in VGG-16 and layer 5, 8 and 13 in YOLOv2) Layer 10 512,14,14 256,512,1,1 25.05%
have to multiply the scaling factor matrix to produce the Layer 11 256,14,14 512,256,3,3 39.77%
exact final output, so they have almost the same number of Layer 12 512,14,14 256,512,1,1 25.05%
full-precision MACs as the original XNOR-Net. Layer 13 256,14,14 512,256,3,3 0.08%

127

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
V. I MPLEMENTATION ON A N E DGE D EVICE Second, we decouple the calculation of scaling factors
into two parts. The first part getting the average matrix of the
We implement XOR-Net convolution layers on Green-
input is combined with the packing sign bits, and the second
Waves GAP8, a RISC-V based ultra-low power edge pro-
part fake convolution is a standalone for() loop. Therefore,
cessor showed in Fig.2. We take the convolution benchmark
our implementation reduces a loop that goes through all the
suites [18] developed by GreenWaves Technologies as ref-
data, bringing better data locality and coding efficiency.
erence implementations to optimize our benchmark codes to
ensure a fair comparison. We utilize the APIs in GAP SDK B. Bit-wise Convolution
[19] for bit-insert, XOR, and pop-count operations. Packing We use XOR and pop-count operations to get the bit-wise
the sign bits of the activations during quantization needs convolution result O. In the previous section, we have proved
bit-insert, while bit-wise convolution uses XOR and pop- that mathematically XOR-Net binary convolution results are
count as main operations. Our proposed implementation can the same as the binary convolution results that use XNOR
be deployed to other platforms by simply replacing the bit- and pop-count. For the reason that XOR is supported by
insert and pop-count instructions. the instruction sets of CPU and GPU platforms, and using
As binary convolution with scaling factors XOR-Net-S XNOR has to do one more NOT operation on the XOR
has two more steps than XOR-Net without scaling factors, results, so our implementation reduces one NOT operation
we only introduce XOR-Net-S for simplicity. We implement compared with traditional binary convolution.
the data preparation part of the proposed XOR-Net-S follow-
ing Algorithm 1 and implement the bit-wise convolution part Algorithm 1 Input binarization and calculating scaling
following Algorithm 2. Excluding the statements concerning factors
scaling factors, these algorithms will be binary convolution Input: input tensor X, scaling factor matrix from the pre-
without scaling factors. vious layer K (i−1)
Output: input sign tensor S, scaling factor matrix K (i)
A. Data Preparation 1: // Get the sign bits and the average matrix of X
2: packed channel = input channel / 32;
Data preparation in XOR-Net-S includes packing the sign
3: for each input height h do
bits of the input tensor and calculating the scaling factor
4: for each input width w do
matrix of the input. First, we get the sign bits of the
5: sum = 0
activations and pack the sign bits into 32/64-bit integers for
6: for each packed channel pc do
bit-level parallelism. For example, one XOR operation on
7: // Use a 32-bit integer to contain the sign bits
one packed integer equals to one operation on 64 original
8: pack=0
input elements after packing the sign bits into 64-bit integers.
9: for i from 0 to 31 do
Therefore, packing the sign bits into integers brings high bit-
10: // Insert the sign bit to the pack
level parallelism.
11: bitinsert(pack, Xh,w,32pc+i , 1, i)
We pack the sign bits across the input channel following
12: // Sum up the absolute value of the input
BitFlow [21], which keeps the logical input shape as CHW
13: sum+ = |X32pc+i,h,w |
after the packing. The packing of sign bits is realized by
14: end for
the bit-insert instruction, i.e. inserting the first one bit of
15: Sh,w,pc = pack
X32pc+i,h,w to the pack at offset i. In IEEE standard format,
16: end for
the sign bits of both int and float numbers are the first bit.
17: Ah,w = sum, the first layer
We avoid if() statements or sign() functions in the loop by (i−1)
packing the first bit directly to the container to get good 18: Ah,w = sum × Kh,w , otherwise
performance because branches degrade the packing speed. 19: end for
20: end for
21: // Get the scaling factor matrix of the input tensor
22: for each input height h do
23: for each input width w do
24: sum = 0
25: sum+ = Ah+0,w+0
26: ...
27: sum+ = Ah+kh,w+kw
(i)
28: Kh,w = sum
29: end for
30: end for
31: return S, K (i)
Figure 2. GreenWaves GAPuino development board. Source: [20]

128

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
Algorithm 2 XOR-Net bit-wise convolution
Input: input sign tensor S, weight sign tensor W and
scaling factors α, input scaling factor matrix from the
previous layer K (i−1)
Output: convolution result Y , scaling factor matrix K (i)
1: N = c × kh × kw
2: for each filter f do
3: for each output height h do
4: for each output width w do
5: R=0 Figure 3. The execution time of different convolution layers when
increasing the input size.
6: for each packed channel c do
7: R+ = cnt(Sc,h+0,w+0 ⊕ Wc,0,0 )
8: ...
9: R+ = cnt(Sc,h+kh,w+kw ⊕ Wc,kh,kw )
10: end for
11: Of,h,w = N − 2 × R
12: Yf,h,w = Of,h,w × αf , not the last layer
(i)
13: Yf,h,w = Of,h,w × αf × Kh,w , otherwise
14: end for
15: end for
16: end for
17: return Y Figure 4. The speedup compared with parallel full-precision convolution
when increasing the input size.
Finally, we multiply the bit-wise convolution result O with The speedup compared with parallel full-precision convo-
the scaling factors of the activations and weights according lution is shown in Fig.4. The original XNOR-Net achieves
to equation (20). What’s more, we utilize filter level paral- 52-73× speedup while the optimized XOR-Net-S achieves
lelism when implementing the bit-wise convolution. Since 59-83× speedup. Compared with original XNOR-Net, our
there are eight cores on GAP8, the filter number are usually method XOR-Net-S has 10-17% speedup. Because there is
multiples of 8 and there is no data dependency between the no calculation dealing with scaling factors, BCNN achieves
filters, it is an efficient approach to do parallel convolution 95-114× speedup and XOR-Net has 98-119× speedup com-
across different filters. As for the binarization and scaling pared with parallel full-precision convolution. XOR-Net is
factor calculation, we perform the parallel processing across 3-5% faster than BCNN because we have reduced a NOT
the input height/width/channel wherever possible. operation in the binary convolution pipeline. As NOT is
VI. E VALUATION a simple bit-wise operation that finishes within one clock
cycle, and the operands of the NOT operation are already at
We evaluate full-precision convolution, XNOR-Net and the register after the XOR operation, the overall performance
XOR-Net-S (binary convolution with scaling factors), gain of removing the NOT operation is moderate.
BCNN and XOR-Net (binary convolution without scaling We notice that the speedup is relatively higher when the
factors) at layer level with different configurations on the input channel is 64, so we explore the performance of XOR-
input size, the input channel and the filter number. We Net by increasing the input channel in the next experiment.
employ a Ubuntu 16.04 based workstation as the host
machine, then record the execution time by the hardware B. Increasing the Input Channel
timer on GAP8, and record the power consumption through As observed in the previous section, we perform this
a USB power meter UM25C. Each test case is executed for experiment to find out the relationship between the speedup
at least 10 times to get the average execution time. of XOR-Net and the input channel. We increase the input
channel linearly from 32 to 192, with the input size set to be
A. Increasing the Input Size C × 14 × 14. There are still 32 filters with 3 × 3 kernels. The
We increase the input size (C-H-W) of the convolution speedup compared with parallel full-precision convolution
layers by two accordingly. There are 32 filters with 3×3 layers with the increasing input channel are shown in Fig.5.
kernels in this experiment. Fig.3 shows the execution time Original XNOR-Net and XOR-Net-S have 58-89× and
of all the evaluated layers. The latency of binary convolution 68-95× speedup compared with full-precision convolution
layers is much smaller than the latency of full-precision respectively, while BCNN and XOR-Net have 103-122×
convolution. XOR-Net-S runs faster than XNOR-Net, and and 107-129× speedup respectively. When we increase the
XOR-Net has less latency than BCNN as well. number of input channel linearly, the speedup first goes up

129

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
Figure 5. The speedup compared with parallel full-precision convolution Figure 6. The speedup compared with parallel full-precision convolution
when increasing the input channel. when increasing the filter number.

then become steady. The reason behind such a phenomenon report the power to 0.01 mW for reliability. We execute the
is that when increasing the input channel, we increase both full-precision convolution layer for 20 times and XNOR-Net
the input depth and the quantization overhead. Increasing layers for 2500 times to make sure there are enough points
the input depth brings better data locality, but the data of power measurement, then calculate the average value of
locality benefit will not grow much when the input depth a single inference. The speedups in Table II are slightly
is too large. What’s more, increasing the input channel also different from those reported in the previous subsection
increases the binarization and bit-packing workload. As the because this is another run.
data preparation overhead increases with the input channel, Binary convolution layers consume no more than 1%
the speedup cannot grow linearly with the input channel. energy compared with full-precision convolution. Our pro-
posed method XOR-Net-S achieves 125× energy efficiency
C. Increasing the Filter Number compared with parallel full-precision convolution, and saves
The data preparation overhead in the binary convolution 16% energy compared with the original XNOR-Net. XOR-
may affect the speedup. We keep the data preparation Net binary convolution without scaling factors achieves
computation as constant by increasing the filter number only 159× energy efficiency compared with full-precision convo-
to verify this idea. The input size is 64 × 14 × 14, and we lution and is 10% more energy-efficient than BCNN. Though
still use the popular 3×3 kernels in this experiment. Though the speed benefits of XOR-Net compared with BCNN is
popular networks seldom use a small number of filters in moderate, the energy-efficiency of XOR-Net is good due to
a convolution layer, we include small filter numbers for the lower arithmetic density.
experiment purpose. As Fig.6 shows, we obtain the highest
speedup in all the experiments when the filter number is VII. C ONCLUSION
largest: XNOR-Net and XOR-Net-S have 95× and 109× We proposed an optimized computation pipeline XOR-
speedup respectively, and BCNN and XOR-Net have 128× Net for binary network inference on edge devices. For
and 135× speedup respectively. the binary convolution without scaling factors, XOR-Net
As Fig.6 shows, the speedup grows with the increment uses XOR instead of XNOR in the bit-wise convolution
of the filter number. The bit-wise convolution accounts for to save a NOT operation. For the binary convolution with
more time as the filter number increases, and the speedup in- scaling factors, XOR-Net-S further reduces the redundant
creases with the bit-wise convolution ratio. This observation full-precision operations. Viewing the whole computation
is consistent with the basic logic of binary convolution: using pipeline across consecutive convolution layers, XOR-Net-
much faster XOR/XNOR and pop-count operations instead S moves two constants to the scaling factor of the weights
of full-precision multiplication and accumulation operations and multiplies the scaling factor matrix of the activations in
to accelerate CNNs. Therefore, we can get higher speedup the next layer wherever possible. Theoretical analysis shows
by reducing the data preparation overhead or increasing the that XOR-Net reduces one-third of the bit-wise operations
bit-wise convolution ratio in the convolution layer. compared with BCNN and XNOR-Net-S further reduces up
to 40% full-precision operations compared with XNOR-Net
D. Power Consumption Analysis while keeping the same accuracy.
We report the energy efficiency of XOR-Net and other We implemented XOR-Net on an edge device with
binary convolution methods in Table II. The input size bit-level and filter-level parallelism. The experiment re-
for the convolution layer is 64 × 14 × 14 and there are sults show that our optimized XOR-Net achieves 81-135×
128 filters with 3 × 3 kernels. The resolution of voltage speedup and about 159× energy efficiency compared with
of our power meter UM25C is 0.001V (error: 0.05%) and full-precision layers, and 3%-5% speedup and about 10%
the current resolution is 0.0001A (error: 0.1%) [22], which energy efficiency compared with traditional BCNN. The
means that we can measure the power to 0.0001 mW. We optimized binary convolution with scaling factors XOR-Net-

130

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.
Table II
E XECUTION TIME AND ENERGY CONSUMPTION OF FULL - PRECISION AND BINARY CONVOLUTION LAYERS IN A GIVEN CONFIGURATION .

Conv Layer Type Power(mW) Time(ms) Speedup Energy(mJ) Energy Ratio


Full-Precision 430.03 1074.83 1.0× 462.21 100.00%
XNOR-Net 395.76 11.14 96.5× 4.41 0.95%
XOR-Net-S 380.22 9.76 110.2× 3.71 0.80%
BCNN 384.93 8.28 129.7× 3.19 0.69%
XOR-Net 369.27 7.88 136.4× 2.91 0.63%

S is 10-17% faster improves 19% energy-efficiency com- [10] Z. Wang, J. Lu, C. Tao, J. Zhou, and Q. Tian, “Learning
pared to the original XNOR-Net. Exploring the performance channel-wise interactions for binary convolutional neural net-
by increasing the input channel and the filter number, we works,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2019, pp. 568–577.
observe that XOR-Net can achieve higher speedup with more
[11] Arm Limited, “Arm R Instruction Set Reference Guide.”
input channels and filters in the convolution layers. [Online]. Available: static.docs.arm.com/100076/0100/arm
instruction set reference guide 100076 0100 00 en.pdf
ACKNOWLEDGMENT
[12] Intel Corporation, “Intel R 64 and IA-32 Architectures
This work is partially supported by the Ministry of Software Developer’s Manual.” [Online]. Available: https://
Education, Singapore, under its Academic Research Fund www.intel.com/content/dam/www/public/us/en/documents/m
Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1- anuals/64-ia-32-architectures-software-developer-instruction-
001-072), and partially supported by Nanyang Technological set-reference-manual-325383.pdf
University, Singapore, under its NAP (M4082282) and SUG [13] RISC-V Foundation, “The RISC-V Instruction Set Manual,
Volume I: User-Level ISA, Document Version 2.2.” [Online].
(M4082087). Available: https://content.riscv.org/wp-content/uploads/2017/
R EFERENCES 05/riscv-spec-v2.2.pdf
[14] NVIDIA Corporation. (2020) Parallel Thread Execution ISA
[1] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for Version 7.0. [Online]. Available: https://docs.nvidia.com/cuda
convolutional neural networks,” in International Conference /parallel-thread-execution/index.html
on Machine Learning, 2019, pp. 6105–6114. [15] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect:
[2] A. Xygkis, D. Soudris, L. Papadopoulos, S. Yous, and Training deep neural networks with binary weights during
D. Moloney, “Efficient winograd-based convolution ker- propagations,” in Advances in neural information processing
nel implementation on edge devices,” in 2018 55th systems, 2015, pp. 3123–3131.
ACM/ESDA/IEEE Design Automation Conference (DAC). [16] H. Yang, M. Fritzsche, C. Bartz, and C. Meinel, “Bmxnet:
IEEE, 2018, pp. 1–6. An open-source binary neural network implementation based
[3] Z. You, K. Yan, J. Ye, M. Ma, and P. Wang, “Gate decorator: on mxnet,” in Proceedings of the 25th ACM international
Global filter pruning method for accelerating deep convolu- conference on Multimedia, 2017, pp. 1209–1212.
tional neural networks,” in Advances in Neural Information [17] J. Bethge, M. Bornstein, A. Loy, H. Yang, and C. Meinel,
Processing Systems, 2019, pp. 2130–2141. “Training competitive binary neural networks from scratch,”
[4] Y. Guo, “A survey on methods and theories of quantized ArXiv e-prints, 2018.
neural networks,” arXiv preprint arXiv:1808.04752, 2018. [18] GreenWaves Technologies, “GAP8 BenchMark Test Suite,”
[5] H. L. Marat Dukhan, Yiming Wu and B. Maher, Dec. 2019. [Online]. Available: https://greenwaves-technolo
“QNNPACK.” [Online]. Available: https://github.com/pytorch gies.com/manuals/BUILD/BENCHMARKS/html/index.html
/QNNPACK [19] ——. (2019) SDK Manuals. [Online]. Available: https://gr
[6] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and eenwaves-technologies.com/sdk-manuals/
Y. Bengio, “Binarized neural networks,” in Advances in Neu- [20] ——. (2020) Gapuino development board. [Online]. Avail-
ral Information Processing Systems 29. Curran Associates, able: https://greenwaves-technologies.com/product/gapuino/
Inc., 2016, pp. 4107–4115.
[21] Y. Hu, J. Zhai, D. Li, Y. Gong, Y. Zhu, W. Liu, L. Su,
[7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor- and J. Jin, “Bitflow: Exploiting vector parallelism for binary
net: Imagenet classification using binary convolutional neu- neural networks on cpu,” in 2018 IEEE International Parallel
ral networks,” in European conference on computer vision. and Distributed Processing Symposium (IPDPS). IEEE,
Springer, 2016, pp. 525–542. 2018, pp. 244–253.
[8] A. Bulat, G. Tzimiropoulos, and S. A. Center, “XNOR- [22] RuiDeng Technologies, “UM25C USB tester meter
Net++: Improved binary neural networks,” The British Ma- Instructions,” Jul. 2020. [Online]. Available: https://phuketsh
chine Vision Conference (BMVC), 2019. opper.com/software/UM25C/UM25C%20USB%20tester%20
[9] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, meter%20Instructions.pdf
“Bi-Real Net: Enhancing the performance of 1-bit cnns with
improved representational capability and advanced training
algorithm,” in Proceedings of the European Conference on
Computer Vision (ECCV), September 2018.

131

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on September 16,2021 at 16:11:51 UTC from IEEE Xplore. Restrictions apply.

You might also like