0% found this document useful (0 votes)

23 views15 pages

BNN in FPGA

Uploaded by

astecisgood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

BNN in FPGA

Uploaded by

astecisgood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

JID: NEUCOM

ARTICLE IN PRESS [m5G;October 28, 2017;13:0]

Neurocomputing 0 0 0 (2017) 1–15

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

FP-BNN: Binarized neural network on FPGA

Shuang Liang a, Shouyi Yin a,∗, Leibo Liu a, Wayne Luk b, Shaojun Wei a
a
Institute of Microelectronics, Tsinghua University, Beijing, China
b
Department of Computing, Imperial College London, UK

a r t i c l e i n f o a b s t r a c t

Article history: Deep neural networks (DNNs) have attracted significant attention for their excellent accuracy especially
Received 10 December 2016 in areas such as computer vision and artificial intelligence. To enhance their performance, technologies
Revised 10 August 2017
for their hardware acceleration are being studied. FPGA technology is a promising choice for hardware ac-
Accepted 17 September 2017
celeration, given its low power consumption and high flexibility which makes it suitable particularly for
Available online xxx
embedded systems. However, complex DNN models may need more computing and memory resources
Communicated by Dr. Deng Cheng than those available in many current FPGAs. This paper presents FP-BNN, a binarized neural network
(BNN) for FPGAs, which drastically cuts down the hardware consumption while maintaining acceptable
Keywords:
Binarized neural network accuracy. We introduce a Resource-Aware Model Analysis (RAMA) method, and remove the bottleneck in-
Hardware accelerator volving multipliers by bit-level XNOR and shifting operations, and the bottleneck of parameter access by
FPGA data quantization and optimized on-chip storage. We evaluate the FP-BNN accelerator designs for MNIST
multi-layer perceptrons (MLP), Cifar-10 ConvNet, and AlexNet on a Stratix-V FPGA system. An inference
performance of Tera opartions per second with acceptable accuracy loss is obtained, which shows im-
provement in speed and energy efficiency over other computing platforms.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction quentially, which leads to low eﬃciency. Graphics processing units

(GPUs) can offer Giga to Tera FLOPs per second’s (FLOP/s) com-
As the computational ability of processors rapidly grows, train- puting speed due to their single-instruction-multiple-data (SIMD)
ing and testing deep neural networks (NNs) become much more architecture and high clock frequency. Therefore, researchers tend
feasible, which substantially boost the design of various models to use one or several GPUs to meet the model training demand
targeting applications such as computer vision [1–3], speech recog- [11] for quick development iterations. However, GPUs also suffer
nition [4,5], and even artificial intelligence (AI) for games against from a high energy cost – for a NVIDIA Tesla K40 GPU, the thermal
human beings [6,7]. Higher accuracy typically demands more com- design power (TDP) is 235 W [12]. Such power consumption can be
plex models. Take ImageNet Large-Scale Vision Recognition Chal- tolerable for high-performance servers, but for embedded systems
lenge (ILSVRC) as example, Krizhevsky et al. [8] achieved 84.7% such as mobile devices, robots, etc., which are mostly powered by
top-5 accuracy in classification task in 2012 with a model including batteries, low power consumption becomes essential.
5 convolution (CONV) layers and 3 fully-connected (FC) layers; He Field Programmable Gate Arrays (FPGAs) usually consume one
et al. [9] got a 95.1% result surpassing human-level classification order-of-magnitude less power than GPUs, while offering consid-
performance (94.9% [3]) with a 22-layer model, and they won the erable speed-up over CPUs. Moreover, FPGAs offer more flexibil-
2015 competition for achieving an accuracy of 96.4% with a model ity, since they are reconfigurable and support customizable data
depth of 152 [10]. Such model can take over 11.3 billion floating- types, which can be useful in reducing resource utilization. There
point operations (GFLOPs) for the inference procedure, and even is much research on accelerating state-of-the-art NN models with
more for training. FPGAs [13–15]. However, since most current FPGAs have limited
These convolutional neural networks (CNNs) mostly consist resources (several dozen M bits of on-chip memory, several hun-
of intensive multiplication and accumulation (MAC) operations. dred to thousand digital signal processors (DSPs)), designers have
General-purpose processors execute these operations mostly se- to adopt techniques such as tiling to support many NN models,
since most models have a large number of weights and MAC oper-
∗
ations (Table 1). Furthermore, memory bandwidth can be a bottle-
Corresponding author.
E-mail addresses: [email protected] (S. Liang),
neck during the data loading stage for some wide data-dependency
[email protected] (S. Yin), [email protected] (L. Liu), [email protected] pattern such as FC layers [15].
(W. Luk), [email protected] (S. Wei).

https://doi.org/10.1016/j.neucom.2017.09.046
0925-2312/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]

2 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Table 1 the input feature-map (fmap) I in a sliding-window manner with

Summary of weight and MAC number of popular CNNs [21].
a stride of S. This can be expressed as:
Model LeNet-5 AlexNet VGG-16 GoogLeNet v1 ResNet-50 (l )
Nin
(l ) (l )

Weights 60 K 61 M 138 M 7M 25.5 M An (i, j ) = B (n ) + W(l ) (m, n ) Im
(l )
(i, j ), (1)
MACs 341 K 724 M 15.5 G 1.43 G 3.9 G
m=1

where is deﬁned as convolution, which equals to K2 element-

wise multiplications with accumulation (K stands for the kernel
To improve resource usage, there are several ways of compress-
size):
ing models to smaller sizes, such as gaining sparsity of network
connections and narrowing data bit-width [15–17]. Binarization is
K
K

a promising method to compress the NN models, which can di- XY= X(i, j ) · Y(i, j ) (2)
rectly shrink the bit-width of inputs and weights from 32 bit i=1 j=1
(single-precision ﬂoating-point) to a single bit. Recently, Cour- FC layer: The FC layer will operate a linear transformation on
bariaux et al. [18] introduced a method to train binarized neu- the input 1-D vectors with a weight matrix. The pattern of the
ral networks (BNNs) over MNIST, Cifar-10 and SVHN [19] datasets, input-output network is fully-connected, which is how it got its
with near state-of-the-art accuracy. Shortly after that, Rastegari name. This process can be shown as:
et al. [20] announced they successfully trained ImageNet models
(l )
Nin
with BNN-based XNOR-Net method with an accuracy of 12.4% be-
(l )
(l )
low the full precision AlexNet, and provides a 58 times speed- A (n ) = B (n ) + I(l ) (m ) · W(l ) (m, n ) (3)
up and 32 times model size compression. The emergence of bi- m=1

narized models makes it feasible to implement a system on FP- POOL layer: The POOL layer realizes a “down-sampling” opera-
GAs with much higher performance than floating-point versions. tion, which compresses the input images into smaller scales. We
This motivates us to design a method to take a given BNN model take the most common max-POOL as an example, which extracts
and generate the datapath logic and data management pattern on the maximum value from the K × K kernel window as the output:
FPGA based to an optimization metric, which forms an accelera-
tor system targeting Tera operations per second’s(TOP/s) through- A(l ) (i, j ) = max[IK(l×K
)
(i, j ) ] (4)
put speed.
In this paper, we introduce FP-BNN, a BNN acceleration system Activation Layer: Just like biological neurons, we say they are
design on FPGA, with related optimizations. The contributions of “firing” once the key value exceeds the threshold and are “silent”
this paper are as follows: if not. Various activation functions are implemented in neural net-
work designs to imitate the neurological behaviour such as ReLU,
- An analytical resource aware model analysis (RAMA) to assess tanh, sigmoid, etc., which also introduce non-linearity to the net-
the resource cost, to help on-chip system architecture design. works.
- A datapath design with multipliers replaced by XNOR, popcount Batch Normalization (BN) layer: Since the distribution of each
and shifting operations for BNNs, and a compression tree gen- layer’s input can fluctuate during training, Batch Normalization
eration method for more efficient popcount. [23] is introduced to speed up training. For a d-dimensional input
- An optimized data managing pattern with parameter quantiza- vector x = (x(1 ) , x(2 ) , . . . , x(d ) ), we can normalize each dimension
tion and on-chip storage strategy. with:
- A demonstration with popular small (MNIST MLP and Cifar-
x(k ) − E[x(k ) ]
10 ConvNet) and large (AlexNet) models implemented on FPGA x (k ) =
(5)
in binarized style, achieving a performance of TOP/s with high V ar[x(k ) ]
power efficiency.
After that, for each activation x(k) , we should scale and shift the
The rest of the paper is organized as follows. Section 2 reviews normalized value to achieve an identity transform:
the basic concepts of CNN and BNN and discuss on the related y ( k ) = γ ( k )
x (k ) + β (k ) (6)
works. Section 3 describes the RAMA method. Section 4 presents
the system design and the details of each processing element (PE). where andγ (k) β (k)
are to be learned during the training process.
Section 5 explains how we tile and schedule the large computing The whole process is described in Algorithm 1.
task onto our system. Section 6 covers a data quantization to com-
press the model, and introduces the on-chip design of the memory Algorithm 1 Batch Normalization [23].
system. Evaluation will be discussed in Section 7, and conclusion 1: Require: A mini-batch of input values: B = {xi }, i = 1 ∼ m; Ini-
will be given in Section 8. tialized parameters: γ , β .
2: Ensure: Updated γ , β ; Output yi = BNγ ,β (xi ), i = 1 ∼ m.

1
m
2. Background
3: μB = m xi ; //Get mini-batch’s mean
i=1
In this section, we will first provide an overview of the basic
m
4: σB2 = 1
m (xi − μB )2 ; //Get mini-batch’s variance
concepts of CNN, and then explain how a binarized NN works. i=1
Based on these concepts, we take a brief overview of related ef- xi −μB
5: xî = √ ; //Normalize
forts and discuss them. σB 2 + ε
6: yi ≡ BNγ ,β (xi ) = γ xî + β ; //Scale and shift

2.1. Basics of CNN

Fig. 1 shows a typical CNN model structure [22]. A CNN model 2.2. Training a CNN
usually consists of CONV layer, FC layer and Pooling (POOL) layer,
forming a trainable network. CONV layer: The CONV layer realizes a A given CNN model with initialized parameters should be
ﬁlter-like process, which uses a K × K weight kernel W to convolve trained on a certain dataset in order to approximate the ideal

S. Liang et al. / Neurocomputing 000 (2017) 1–15 3

Fig. 1. A typical CNN model structure.

Table 2
Comparison between activation function Tanh, sign and HTanh.

Operation Function plots Derivative plots

1 1
f (x) f (x)

0.8
0.5

x −x
0.6

e −e
T anh(x) =
x
−4 −2 2 4 0.4

ex +e−x −0.5 0.2

x
−1 −4 −2 2 4

⎧ 1
f (x) 1.2
f (x)

⎨ +1 x≥0 0.5
1

0.8

sign(x) =
x 0.6
−2 −1 1 2

⎩ −1
0.4

x<0
−0.5
0.2

x
−1

⎧
−4 −2 2 4

⎪
⎪ +1 x>1
⎪
f (x) 1.2
1 f (x)

⎨ 0.5
1

0.8

HT anh(x) = x −1 ≤ x ≤ 1
x 0.6
−2 −1 1 2

⎪
0.4

⎪
−0.5

⎪
0.2

⎩ −1
x
−1

x < −1
−2 −1 1 2

model for ground-truth results. The most commonly used training tion:
method is Back-Propagation (BP) training, which consists of two
+1 x>1
stages: est (Sign(x )) = H Tanh(x ) = x −1 ≤ x ≤ 1 (8)
−1 x < −1
(1) Forward propagation (Inference), which leads the input data go-
ing through the network to get an output result; Assume the required gradient is ddCu , and a = Sign(u ), then we will
(2) Back propagation, which calculates the error between output have the estimator of the gradient:
and ground-truth labels with a defined loss function C, and then dC
propagates the gradient of each layer’s output function back-
dC dC dSign(u ) −1 ≤ u ≤ 1
est = · = da
(9)
wards to update the weights in order to minimize the loss func- du da du 0 otherwise
tion for the next training iteration. Since BN layers have the effect of avoiding internal covariate
shift, which can accelerate the training process and reduce the im-
Detailed derivation can be found in [24]. Since the overall pro- pact of binarization, [18] introduces BN layers in their BNN mod-
cess is compute-intensive, high-performance servers with acceler- els. To deal with the large amount of multiplications in BN, they
ators such as GPUs are often used in training. Then the pretrained replace them with shift operations to get a Shift-Based BN (SBN).
models can be used in many real-time scenarios by going through This can largely reduce the computing resource cost with only a
inference process only with minor changes, which can be imple- small loss of precision – which actually can be healed through
mented on many embedded hardware platforms. the training process. The SBN replacement can be described as
Eq. (10) where sal(x, y) means an arithmetic left shift to x by y
bits:
2.3. How BNN works
x · y ≈ sal [x, round (log2 |y| )] · sign(y ) (10)
The essential idea of BNN is to constrain both weights and ac-
tivations to +1 and −1 [18]. The binarization method can be done 2.4. Related work
in either stochastic or deterministic way, and the latter is often re-
alized by the Sign function: To accelerate an NN model in embedded hardware, spade hus-
bandry should be taken. There has been many efforts deploying
+1 x≥0 CNN models in hardware. Farabet et al. [26] designed a 3 CONV
xb = Sign(x ) = (7)
−1 x<0 layers +5 FC layers simple face detection system on FPGA with 10
frames (512 × 384) per second’s performance. Zhang et al. [14] pro-
The problem is that during the training process, the derivative of posed a nested-loop model to describe CNN, and accelerates CONV
the Sign function is almost zero everywhere (as shown in Table 2), layers only under the guidance of a roofline model. Qiu et al.
resulting in an incompatibility with the BP training process. Hinton [15] realized an even deeper VGG model on FPGA. Most of these
[25] introduced a “straight-through estimator” to cope with this previous designs store weights and fmaps off-chip since their size
problem. Courbariaux et al. [18] used a similar estimator in a de- is too large for on-chip storage. As a result, the dataflow bandwidth
terministic way, which can be seen as a hard tanh (HTanh) func- is limited and frequent off-chip memory access happens. So some

4 S. Liang et al. / Neurocomputing 000 (2017) 1–15

designs support dedicated memory cache for on-chip data reuse Table 3
Resource cost of MACs on Stratix V FPGA.
[27–29], but the increase of memory placement means fewer arith-
metic resources since chip area is limited. Operation LUT FF DSP
Clearly a small model that supports high accuracy and high per- 32-bit float add( + ) 581 525 0
formance is ideal. One method is to exploit the sparsity inside the x-bit fixed add( + ) x x+1 0
model by pruning off connections [16,30,31]. Another method is to 32-bit float mult( × ) 147 363 1
x
reduce bit-width of operations. Much previous work took a quan- x-bit fixed mult( × ) 0 1 18

tized ﬁxed-point strategy to the on-chip data [15,27,28,32,33] pre-

sented a detailed analysis pointing out that for small models such
as MNIST and Cifar-10, the weights can be quantized to 4 bits,
while for large models such as AlexNet, 8 bits would be necessary.
Recently, some efforts successfully reduced the bit-width of
weights to 2 bits such as ternarized weight NN (TWN) [34,35], or
even 1-bit binarized weight NN (BWN) [20]. Moreover, activations
can be reduced to 2 bits [36–38] or even 1 bit (BNN) with little
loss for small datasets [18,20]. These results stimulate hardware
development. YodaNN [39] designed a UMC 65-nm ASIC targeting
BWN with 1.5 TOP/s. Alemdar et al. [37] implemented ternarized
NN (TNN) on FPGA with a speed of 600 GOP/s for MNIST MLP and
200 GOP/s for Cifar-10 ConvNet under 250 MHz clock, and on ST
28 nm ASIC with doubled throughput and around 300 mW power
consumption under 500 MHz clock. Zhao et al. [40] implemented
a BNN on FPGA with the help of high-level synthesis (HLS) tool,
and get a 200 GOP/s performance for Cifar-10 ConvNet. In addi- Fig. 2. Weight storage strategy selection for small (MNIST & Cifar-10) and large
tion, Umuroglu et al. [41] also proposed a BNN design targeting (AlexNet) models.
small datasets MNIST and Cifar-10 and reached a performance of
TOP/s.
We should notice that since the bit-width of data has been re- For operations in BN layers, the number of operations (NOP)
duced by 32 times in BNN, an execution speed of TOP/s is ex- has a linear relationship with the number of output channels Nout .
pected since many recent non-BNN designs have already reached a Notice that the shift-based transformation can change multiplica-
performance of several hundred GOP/s. The key optimizations in- tions into cheap sum and shift operations. To get NOP after tiling,
clude: (1) single-bit based MAC operation, which can be replaced we just need to replace the original dimensions with tiled ones,
by eﬃcient XNOR and popcount operations and can be free from and then we can estimate the resource cost for a certain type
conventional multiply and add operations; (2) small size for both Cres_type (layer ) by summing up the product of tiled NOP and re-
parameters and intermediate results, which would enable on-chip source cost of one operation, which in tern help us determine the
caching; (3) broaden bandwidth for on-chip BRAMs, which would tiling factor.
reduce the bottleneck of data dependency with wide data-access Next, from the memory perspective, we should concentrate on
patterns such as those in FC layers. Our FP-BNN design is devel- the size of parameters and the activation outputs of each layer.
oped based on the above motivations. Furthermore, FP-BNN sup- Given that Nlayer (data ) denotes the size of a certain kind of data
ports large models such as XNOR-Net version AlexNet. in one layer, then for the weights we have

NCONV (W ) = Nin × Nout × K 2 (12)

3. Resource-Aware Model Analysis (RAMA)
For other parameters, such as biases, normalization parameters,
they are given by the number of output channels, that is
To design an NN accelerator on chip, we should consider how
to tile the overall task onto limited resources, which can be classi- NCONV (Other ) = Nout (13)
fied into two classes: arithmetic units and memory units. To help
For activations, we have
choosing the size of task tiles, we need to estimate the resource
cost beforehand. The RAMA method is introduced to address this NCONV (A ) = Nout × R2out (14)
need.
The overall memory cost of each type of data is the product
In modern FPGA platforms, four kinds of resources are pro-
of the bit-width and the amount of data. For weights, from Fig. 2
vided: look-up tables (LUTs), flip-flops (FFs), block RAMs (BRAMs)
we can see that in an ideal binarized condition, small models
and digital signal processing units (DSPs). LUTs and DSPs are the
can completely be stored in on-chip BRAMs, while large models’
key to form arithmetic and control logic, while BRAMs are usu-
amount of weight can exceed the upper limit of available BRAMs.
ally used as on-chip storage for fast data access. From the arith-
We use a tiled weight storage strategy that takes only one portion
metic perspective, MACs are the key operations which cost most
of weights required for the current tile from off-chip memories.
resources. DSPs have hard-wired multipliers and can be configured
For activations (feature maps (fmaps)), since data adjacency will
to quickly deliver results under high clock frequency – and one can
be needed in both vertical and horizontal axis, BRAM will not be
choose LUTs to implement a customized multiplier. We compare
a suitable choice since it can only be configured into fixed shapes,
the resource cost of these two ways on a Stratix V FPGA synthe-
and the maximum width of one BRAM is often no more than 40
sized with Altera Quartus v13.1, and the result is shown in Table 3.
and accordingly the minimum depth is 512.
With the resource cost of one single MAC operation in hand, we
need to further count the number of MACs in each layer, which can
be represented as Nlayer (MAC ). For CONV layers we have (FC layers 4. Hardware logic design
can be seen as K = Rout = Cout = 1):
In this section, we present the hardware logic design of our
NCONV/F C (MAC ) = Nin × Nout × K 2 × Rout × Cout (11) FPGA accelerator system.

S. Liang et al. / Neurocomputing 000 (2017) 1–15 5

Fig. 3. A normal structure of a BNN model.

Table 4
RAMA-based topology analysis of MNIST MLP, Cifar-10 CONV-Net and AlexNet.

Macro Layer Structurea Nin × R2in K S KPOOL b Nout × R2out N (W ) N (Others ) N (MAC ) N (A )

MNIST MLP 1 F-B-A 784 − − − 2048 1.61 M 10240 1.61 M 2048

2 F-B-A 2048 − − − 2048 4.19 M 10240 4.19 M 2048
3 F-B-A 2048 − − − 2048 4.19 M 10240 4.19 M 2048
4 F-B 2048 − − − 10 2048 50 20.48 K 10
Total 10.01 M 30.77 K 10.01 M −
CIFAR10 ConvNet 1 C-B-A 3 × 322 3 1 − 128 × 322 3456 640 3.54 M 131.07 K
2 C-P-B-A 128 × 322 3 1 2 128 × 162 147.46 K 640 150.99 M 32.77 K
3 C-B-A 128 × 162 3 1 − 256 × 162 294.91 K 1280 75.50 M 65.54 K
4 C-P-B-A 256 × 162 3 1 2 256 × 82 589.82 K 1280 150.99 M 16.38 K
5 C-B-A 256 × 82 3 1 − 512 × 82 1.18 M 2560 75.50 M 32.77 K
6 C-P-B-A 512 × 82 3 1 2 512 × 42 2.36 M 2560 150.99 M 8192
7 F-B-A 8192 − − − 1024 8.39 M 5120 8.39 M 1024
8 F-B-A 1024 − − − 1024 1.05 M 5120 1.05 M 1024
9 F-B 1024 − − − 10 10.24 K 50 10.24 K 10
Total 14.02 M 19.25 K 61.69 M −
AlexNet ConvNet 1 C-P-B-A 3 × 2242 11 4 3 96 × 272 34.85 K 480 105.42 M 69.98 K
2 C-P-B-A 96 × 272 5 1 3 256 × 132 614.40 K 1280 447.90 M 43.26 K
3 C-B-A 256 × 132 3 1 − 384 × 132 884.74 K 1920 149.52 M 64.90 K
4 C-B-A 384 × 132 3 1 − 384 × 132 1.33 M 1920 224.28 M 64.90 K
5 C-P-B-A 384 × 132 3 1 3 256 × 62 884.74 K 1280 149.52 M 9216
6 F-B-A 9216 − − − 4096 37.75 M 20480 37.75 M 4096
7 F-B-A 8192 − − − 4096 16.78 M 20480 16.78 M 4096
8 F-B 4096 − − − 10 0 0 4.10 M 50 0 0 4.10 M 10 0 0
Total 62.37 M 52.84 K 1.14 G −
a
F = FC, C = CONV, P = POOL, B = BN, A = Activation (HTanh+BNeu).
b
All pooling layers’ stride is 2.

4.1. Overall architecture Next, we take a look at the details of different types of PE de-
sign.
A normal structure of a BNN model is given in Fig. 3. We can
divide the model into several macro-layers with similar structures,
4.2. C/F PE
each including a convolution or fully-connected (C/F) layer, a batch
normalization (BN) layer and an activation layer which consists of
4.2.1. XNOR-based Binary MAC
a Hard Tanh (HTanh) layer and a Binarized Neuron (BNeu) layer.
Normally, it is necessary to utilize DSPs or customized LUT-
For some macro layers, pooling is introduced for down-sampling.
based logic to complete a MAC operation for both floating-point or
Here we choose MNIST MLP and Cifar-10 ConvNet as small dataset
fixed-point input values. However, if input values become binary,
examples, and AlexNet for large dataset ImageNet. The topology of
it will be much different.
each model is described in Table 4, and the key features of each
Consider two input vectors A = {ai } and B = {bi } (i = 1 to Nin )
layer are extracted based on RAMA, in which Rin and Rout are re-
which consist of binarized values either +1 or −1, then the prod-
spectively the input and output image size, K is the convolution
uct of the corresponding elements in two vectors will also be ei-
kernel (window) size, S is the stride of the moving window, and
ther +1 or −1. The sign of the product depends on the two input
KPOOL is the pooling window size.
elements’ signs - if they are identical, then the product will be pos-
The overall system is shown in Fig. 4. We have altogether NPE
itive, otherwise it will be negative. Then, we need to accumulate
channels to process in parallel the data from the input cache.
these binary values to get a final result. This process is depicted in
CONV/FC (C/F) layer includes processing elements (PEs) that are
Fig. 5(a).
shared by the CONV and FC since they both mainly consist of MAC
Hardware implementations usually take 2 bits to represent +1
computations. Shift-Based Normalization (SBN) layer adopts shift
and −1. If we use only one bit, we should take 0 and 1 as the
operations to replace multiplications as mentioned in Section 2.3.
basic values. This can be achieved through affine transformation.
Activation layer merges the HTanh and BNeu layers together to
Since we have
produce an output vector containing either 0 or 1. Parameters for
each layer are fetched from on-chip BRAMs or registers to meet A −1,1 +A 1
A 0,1 = (15)
bandwidth requirements, and control signals select them for each 2
iteration. The output for each iteration will be transferred to the
in which A 1 represents the all-1 vector of the same length of
intermediate result cache. For each next layer, the interconnection
A −1,1 . To keep the truth table for the result as shown in Table 5,
will be reconfigured by the controller according to the type (CONV
we can infer that the operation should be transformed from multi-
or FC) of the layer.
plication to XNOR. In addition, if we assume r to be the dot product

6 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 4. The overall system architecture design.

Table 5
Truth table of aﬃne transformed inputs and result.

Original multiplication Aﬃne transformed

a −1,1 b −1,1 a·b −1,1 a 0, 1 b 0, 1 a·b 0, 1

1 1 1 1 1 1
1 −1 −1 1 0 0
−1 1 −1 0 1 0
−1 −1 1 0 0 1

of vector A −1,1 and B −1,1 of length vec_len, then we will have

vec _len
result = A −1,1 ·B −1,1 = ai −1,1 · bi −1,1
i=1

vec _len
= (ai −1,1 · bi −1,1 ) − (−ai −1,1 · bi −1,1 )
i=1

vec _len
= X NOR(ai 0,1 , bi 0,1 ) − XOR(ai 0,1 , bi 0,1 )
i=1
= 2 popcount (R 0,1 ) − vec_len (16)
in which
R 0,1 = {X NOR(ai 0,1 , bi 0,1 ), i = 1 to vec_len} (17)
If one of the inputs is already 0, 1 based, for example, the ﬁrst
layer, then we get the result with:
+A 1 A −1,1
Fig. 5. Conversion from (a) −1, 1 -based MAC to (b) 0, 1 -based XNOR and pop-
result = A 0,1 ·B −1,1 =
· B −1,1
2
count operations.
2 popcount (R 0,1 ) − vec_len bi −1,1
= +
2 2

= popcount (R 0,1 ) − vec_len + b i 0,1 (18)

This means we need to add the popcount of vector B 0,1 in-

stead of left-shifting 1-bit, as shown in Fig. 5(b). The layer control

S. Liang et al. / Neurocomputing 000 (2017) 1–15 7

Fig. 6. The popcount compressor tree based on 6:3 compressors and one ternary adder.

signal selects the operation to the output of the popcount com- Algorithm 2 Popcount compressor tree generation algorithm.
pressor tree.
1: Require: Input vector: i of height N
2: Ensure: Updated: Column vector height h(i, j ), i stands for the
4.2.2. Popcount Compressor (PC) tree weight of 2i and j for the compression stage; Heap of stage j:
The popcount value, also known as Hamming Weight, can easily H ( j ) = {h(k, j )}, k = 0, 1, . . . , log2 (N ).
be calculated in parallel hardware. However, for long vectors, this 3: h(0, 0 ) = N, i = 0, j = 0;
process can be demanding both in time and in resource usage. The 4: while max(H ( j )) > 3 do
most common way is to use a binary full adder tree to sum up the 5: H ( j + 1 ) = zeros(1, log2 (N ));
bits in vectors, which will result in a delay of l og2 (vec_l en ) and 6: for k = 1 to log2 (N ) do
n − 1 adders of different bitwidth. Here we present a compressor 7: if h(k, j ) > 3 then
tree method inspired by [42]. 8: ncompressor (k, j ) = h(k, j )/6;
The popcount process can be seen as compressing N input bits 9: h(k, j + 1 ) = h(k, j + 1 ) + ncompressor (k, j );
into log2 N + 1 result bits with weights. Since most modern FPGA 10: h(k + 1, j + 1 ) = h(k + 1, j + 1 ) + ncompressor (k, j );
architectures have 6-input LUTs, a 6:3 compressor (can be seen as 11: h(k + 2, j + 1 ) = h(k + 2, j + 1 ) + ncompressor (k, j );
N = 6) is therefore an eﬃcient basic component, for it can calcu- 12: else
late the popcount of a 6-bit input vector in a look-up table, which 13: h(k, j + 1 ) = h(k, j + 1 ) + h(k, j );
leads to a 3-bit popcount output with only three 6-input LUTs in 14: end if
parallel. 15: end for
Given that tuple T = ( pk ; qk+m−1 , . . . , qk+1 , qk ) represents a pk : 16: j = j + 1;
m compressor, where the subscript j = k, k + 1, k + 2 . . . stand for 17: end while
the bit weight 2j and pj , qj stand for the input and output bit num-
ber of certain weight, respectively. In this way, a 6:3 compressor
Table 6
can be represented as (6; 1, 1, 1). Comparison between accumulation adder tree and popcount com-
As shown in Fig. 6, the input vector is divided into 6-bit por- pressor tree.
tions, each connected to the input ports of a 6–3 compressor.
BWin (bits) BWout (bits) LUTs
Empty input bits will be ﬁlled with dummy 0’s (shown as hollow
dots in Fig. 6). For the following stages, the bits with the same Acc. Pop. Saved (%)

weight (we call it column vector) will repeat the same process, 2
9 (3 ) 4 9 10 −11.1
forming a compressor tree. The output bits will heap due to their 16 5 21 19 9.52
64 7 98 79 19.39
weights. Our target is to reduce the height of the heap to 3, which
256 9 398 291 26.88
can then be accepted as three input vectors of a ternary adder to 1024 11 1596 1106 30.70
get the ﬁnal sum. Thus, the column vectors with height of less 1152 (128 × 32 ) 11 1796 1228 31.63
than 4 will stop being compressed for the next stage, and the com- 1200 (48 × 52 ) 11 1864 1282 31.22
pression process will terminate when all column vectors’ heights 8192 14 12768 8362 34.51

are less than 4. The overall process of generating a compressor tree

is described in Algorithm 2.
With the help of Algorithm 2, we obtain the compressor tree ers’ weights are not binarized. To deal with this, if a vector x of n
topology for the hardware implementations of popcount functions ﬁxed-point inputs with m-bit precision:
for different sizes of binary vectors. Table 6 gives the comparison
between accumulation adder tree and compressor tree. As we can x = ( xm −1 m−2
x
n−1 n−1
...x0n−1 , xm −1 m−2
x
n−2 n−2
...x0n−2 , . . . , xm
0
x0 ...x00 )
−1 m−2
(19)
see, for long vectors, compressor tree saves around one third of
LUT resources. and a vector w of n p-bit weights:

4.2.3. PE reuse w = (wnp−1 w p−2 ...w0n−1 , wnp−1

−1 n−1
w p−2 ...w0n−2 , . . . , w0p−1 w0p−2 ...w00 )
−2 n−2
With the XNOR array connected to a popcount compressor tree, (20)
we can get the result for 0/1 input arrays. For all intermediate lay-
ers, the inputs (activations from the preceding layer) and weights then the output vector s could be calculated by
must be binarized (either 0 or 1). However, this is not the case for

p

m
n
the ﬁrst layer – we usually take a ﬁxed point input image from s=x·w= 2i−1 2 j−1 (xkj−1
−1
· wik−1
−1
) (21)
the input cache. Also for some large models like AlexNet some lay- i=1 j=1 k=1

8 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 7. The C/F layer PE module.

Fig. 8. The SBN layer PE module (MNIST and Cifar-10).

Implementing Eq. (21) requires reuse of PE. Hence, we intro- For all models, we train them with the last BN layer kept as
duce an accumulator to the PE with a selectable left-shifter. While the original batch normalization in order to avoid accuracy loss,
the precedent lower bit vector is being processed, the next input that is, with no shift-based operations. Floating-point multiplica-
vector can be loaded behind, and be added with the shifted result tions are implemented independently with DSP blocks to accom-
of the precedent vector. The start and the end of the accumulation plish the multiplication operations, and for ImageNet classification
will be set by the controller signal. A detailed scheduling will be (AlexNet), the output process will be tiled.
introduced in Section 5.
The overall structure of C/F PE is given in Fig. 7. For AlexNet, the 4.4. Activation PE
binarization method is different from sign function[20]. It intro-
duces a binarized filter wbin for w with a scaling factor α in order For the last part of a normal macro layer, we need to binarize
to approximate the MAC operation by x · w ≈ α (x · wbin ), where the result into either 0 or 1. In the training process, the HTanh
wT wbin
α= n = 1n w1 . So for AlexNet the accumulator is followed function, as shown in Table 2, constricts the values between −1
by a multiplier to time the scaling factor α . and 1, while the final BNeu layer will push those values in between
to the two boundaries, which means −1 for all the negatives and
4.3. BN PE 1 for the others. Since we introduce in Section 4.1 that the (−1,1)
based vectors can be affine transformed into (0,1) vectors, the case
As described in Algorithm 1[23], the batch normalization pro- becomes much simpler: we just need to discern all the negative
cess can be presented as: values from the SBN layer and set them to 1, with the others to
x−μ 0. This can be done directly by accessing the signal bit of these
y= · γ + β, (22) values.
σ
where μ stands for the running mean value, σ stands for the stan-
dard deviation. γ and β are the learnt values to implement affine 4.5. Pooling
scale and shift for an identity transform. However, as mentioned
in Section 2.3, floating-point multiplications are required for every For some macro-layers, pooling is applied to support sub-
normalization process, which will lead to a considerable resource sampling in order to reduce the output fmap size. As shown in
cost. For this reason, [18] uses shifting to approximate the multiply Table 4, pooling comes closely after CONV layer, and the pooling
operation. For Eq. (22), the shift-based approximation would be: type for all of our models is max-pooling. If pooling comes after
activation, most of the output values will be +1 which result in
γ
y = sal [(x − μ ), φ ] · sign + β,
σ
(23) significant information loss for training [20]. So a C-P-B-A macro-
γ layer structure is taken. However, for the inference process, a C-B-
where φ = round (log2 σ ) is the left-shift value of both σ and γ . A-P structure can get an identical result and the pooling is applied
As we have the pre-trained models in hand, we can calculate to values of 0 and 1 only. This can be directly implemented with
γ
the required parameters for BN like σ in advance, and store them OR operations. We organize selective line buffers after the activa-
into the corresponding parameter cache. With the above, we get tion PEs, and when a pooling process is required, the activation
the SBN PE as presented in Fig. 8. We have also noticed that the values will stream into the line buffers. If the pooling size is K, we
shifting-based approximation would cause severe accuracy drop for will enable OR operations to the horizontal targeted locations once
AlexNet (from 42.9% to 31.9%), so we avoid using shift replace- K rows of activations arrive, and then a similar process is repeated
ment for AlexNet and keep the original batch normalization, using along the vertical direction with other line buffers to complete a
a multiplier to replace the shift and sign operations. 2D max-pooling.

S. Liang et al. / Neurocomputing 000 (2017) 1–15 9

Table 7
Tiling strategy for different models.

Model NPE PEsize Lin of layer

1 2 3 4 5 6 7 8 9

MNIST 1024 784 1024

Cifar-10 64 1152 405 1152 1024
AlexNet 1200 1089 1200 1152 1024

will be implemented through XNOR + Popcount, and the m-bit of

input is actually calculated in different run and accumulated with
shifting. As the P Esize is much larger than Lin = Nin × K 2 in the ﬁrst
layer, we can repeat the innermost MAC by 2j times inside one PE
to complete the j iterations in one run. This process is illustrated
in Fig. 10. If t-bit is tiled in one run, then we have

t
Lin = Nin · K 2 · 2i−1 (24)
i=1

and through this tiling, the utilization rate of PE for the ﬁrst layer
is increased.
With RAMA and the information given in Tables 4 and6, our
tiling strategy is shown in Table 7.

6. Memory system design

In this section, we introduce the memory system design for FP-

BNN. This mainly consists of two parts: the ﬁrst is parameter quan-
tization and storage, and the second is on-chip fmap caching.

6.1. Quantization over other parameters

Fig. 9. Task scheduling for C/F layer: (a)FC; (b)CONV.

Since in BNN models, weights have already been binarized, so

to take a further step, we quantize non-weight parameters which
5. Task tiling and scheduling (T&S) we call as other parameters. It would be essential for small models
since we would like to store everything on-chip. These parameters
This section introduces T&S, a method for tiling and scheduling are in floating-point format, and their number mostly is equal to
models on chip. The T&S method is depicted in Fig. 9. We assume the number of output channels of a layer. BRAMs would be too
the width of one PE to be PEsize , and the number of tiled input wasteful for their storage, since these parameters need to be pro-
channels to be TNin . The number of tiled output channels equals to vided in parallel and the width equals to NPE , which would make
the number of PE channels NPE . the depth of their storage too shallow. So we place them in fast
For FC layers, one tiled input will be fed into all PE channels. distributed memories. Such memories are available in Altera FP-
For better resource utilization we set TNin as close as possible to GAs as memory logic array blocks (MLABs). Since we also need to

P Esize . It will take
Nin
iterations to get the intermediate result use MLABs to construct intermediate cache, quantization of other
TN
in parameters is adopted to make the best use of limited MLAB stor-
Nout age.
accumulated for one output. This will be repeated by NPE times
A fixed-point number can be represented as:
for all outputs get done.
For CONV layers, since most filter kernel size K is rather small,
BW −1
we consider joining several filters together as one input for C/F PE. n= Ni · 2i− fl , (25)
For one PE, the input will be summed up to get one output, so the i=0
joined filters should be at the same location of input fmaps as dif- where BW stands for the overall bitwidth (including the sign bit)
ferent locations have no dependency to each other. Theinput vec- of the number and fl stands for the fractional part bitwidth. Here
Nin
tor size will be Lin = TNin × K 2 . Similar to the FC layers, TN will we use Q = (BW, − fl ) to capture the quantization strategy for a
in
be taken to get one output pixel, and the next tiled input location particular type of parameters in a layer. To transform a floating-
will be given in a sliding window style. point number nfloat to a fixed-point number nfixed using a given Q
Considering the datapath shared among different layers, NPE strategy, we make use of the round-to-nearest rounding mode[43]
should be a common divisor of Nout of different layers, TNin for each to shift and cut:
layer should preferably be best a sub-multiple of Nin , and PEsize
nfixed = sal {round[sal (nfloat , fl )], − fl } (26)
should be a big value and also close to Lin to best explore the re-
source utilization. We use this method to quantize parameters for various models
The first layer can become a bottleneck for the datapath since to see how the accuracy fluctuates with the bitwidth variation
the number of Nin is small (usually 3). Let us study Eq. (21), for in- (Fig. 11). It is obvious that when the bitwidth of parameters drops
put values which are not binarized, TNin can get multiplied if tiling below a particular threshold, the model accuracy drops signifi-
could be achieved inside Eq. (21). Notice that the inner most MAC cantly.

10 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 10. Example of tiling for multiple bit case (2-bit input and 1-bit weight).

Fig. 11. The model accuracy variation as a function of bitwidth of BN’s mean μ ((a)MNIST; (c)Cifar-10) and aﬃne bias β ((b)MNIST; (d)Cifar-10), respectively.

For the biases of C/F layers, we discover that even if their ConvNet, the weights can fit into an array of BRAMs. For AlexNet,
bitwidth drops to 0, there is still no significant variance of the re- the weights will be tiled by each layer or inside a layer once they
sults. So, to save storage and computing resources, we ignore the get too large. The oldest weights for finished tiles will be covered
C/F layer bias addition. For the running means μ and the affine by weights for the next tile from off-chip memory in a ping-pong
biases β , the point varies between 2 to 8-bit for different layers. fashion. For other parameters, a similar storage structure is pro-
For the shift parameter of BN layers, we can express each φ as posed based on MLABs.
φmin + φ , and therefore we only need to store the variance us-
ing φ to reduce resource usage. So the bitwidth w of BW can be
calculated as:

log2 (φmax − φmin ) + 1, φmax > φmin
w= (27) 6.3. Intermediate cache
0 φmax = φmin
We choose a dynamic fixed-point strategy [44] to optimize the For the intermediate outputs, that is, the output fmaps of each
bitwidth for each layer. Table 8 shows all the value ranges and BW macro layer, we place a cache to hold them for the next macro
strategies that we have chosen for each parameter, and Table 9 layer to read. The design of intermediate cache is given in Fig. 13.
shows the comparisons between the quantized models and the For CONV layers, the cache structure facilitates the sliding window
original models. data fetching. For each input iteration, we separate the intermedi-
ate cache into TNin groups, each containing K memory blocks, and
6.2. Memories for parameters every two adjacent blocks storing consecutive rows. The input logic
needs to offer corresponding addresses for the required K rows,
The memory storage structure for weights is given in Fig. 12. In and select rows to choose the horizontal required K-bits. So in to-
order to reduce the memory access time, we keep the parallelism tal we get TNin × K 2 windows to form an Lin long tensor. The output
of memories identical to the number of PE channels NPE , and the fmaps of each layer will be stored into the spared space from the
width of each memory equals to P Esize . Weights for each comput- input fmaps in a ping-pong way. For FC layers, we just need to en-
ing tile, which is represented in form, will be arranged serially. sure that the cache bitwidth can satisfy TNin . For MNIST MLP we
An address generator will be controlled by the overall controller choose 32 32 × 32 MLABs for intermediate cache, and for Cifar-10
in order to provide the exact weight. For MNIST MLP and Cifar-10 ConvNet and AlexNet, we take 384 32 × 32 MLABs.

S. Liang et al. / Neurocomputing 000 (2017) 1–15 11

Table 8
Quantization strategies for parameters other than C/F weights (MNIST & Cifar-10),

Model Layer FC/CONV Batch normalization

Bias Mean (μ) Stdv & aﬃne weight (φ = log2 σγ ) Aﬃne bias (β )

Min Max Q Min Max Q Min Max Q Min Max Q

MNIST 1 −24.3146 22.2333 (0,0) −139.1505 148.0735 (6,3) 1 10 (4,0) −3.0121 3.0051 (4,−1)
2 −29.2907 34.2132 (0,0) −156.4832 166.0365 (6,3) 0 10 (4,0) −3.1457 3.1249 (4,−1)
3 −26.6156 35.4796 (0,0) −135.1257 139.6590 (7,2) 2 11 (4,0) −2.6029 2.6408 (2,1)
4 −0.1449 0.0467 (0,0) −44.5449 39.5620 (2,5)
Cifar-10 1 −0.9777 0.9976 (0,0) −0.9788 0.9983 (4,−3) 1 2 (1,0) −1 0.9998 (8,−6)
2 −0.9990 0.9998 (0,0) −42.0525 112.1753 (6,2) 6 7 (1,0) −0.7739 0.5044 (5,−4)
3 −0.9987 0.9998 (0,0) −97.7165 76.5325 (5,3) 6 6 (0,0) −0.9993 0.9991 (7,−6)
4 −1 0.9923 (0,0) −78.6930 190.5350 (6,3) 5 7 (2,0) −0.9661 0.6624 (6,−5)
5 −0.9968 0.9964 (0,0) −155.4622 172.6664 (7,2) 6 7 (1,0) −1 1 (5,−3)
6 −0.9862 1 (0,0) −29.3337 270.3043 (6,4) 5 8 (2,0) −0.9983 0.9684 (5,−4)
7 −1 1 (0,0) −798.1793 761.8890 (7,4) 4 8 (3,0) −1 1 (5,−3)
8 −1 1 (0,0) −131.6111 144.3177 (4,5) −6 7 (4,0) −1 1 (4,−2)
9 −0.2266 0.1051 (0,0) 108.3246 203.2282 (4,5)

Table 9
Model classiﬁcation accuracy and size comparisons among original, binarized and quantized ones.

Model Accuracy Parameter Size

C/F Weights (M bit) Others (K bit)

MNIST Original (98.7 ± 0.2)% 305.63 961.56

BNN 98.32% 9.55 961.56
Ours 98.24% 9.55 82.02
Cifar-10 Original 89.06% 427.92 601.56
BNN 86.80% 13.37 601.56
Ours 86.31% 13.37 49.66
AlexNet Original 56.6%(top-1), 79.4%(top-5) 1903.31 1651.25
XNOR-Net & Ours 42.90%(top-1), 66.80%(top-5) 87.05 1651.25

Fig. 12. Memory storage management pattern for weight cache.

7. System evaluation 7.1. System environment

We evaluate the performance of our accelerator system in this We train the models on an IBM x3650 M4 server equipped
section. Environment setup and NN model preparation will be in- with an NVIDIA Tesla K40 (28 nm feature size, 2880 CUDA cores
troduced ﬁrst. We target CNN models to train for a binarized ver- with 12 GB GDDR5 external memory) and a K80 GPU (28 nm fea-
sion, and apply quantization to the parameters to further compress ture size, 4992 CUDA cores with 24 GB GDDR5 external memory)
the model. Then we map the optimized BNN models onto FPGA, card, and use both cards to accelerate the training process. The
and provide performance analysis comparing FP-BNN with general evaluation system is built on Maxeler’s MPC-X20 0 0 platform [45].
purpose processors and other FPGA designs. The system has 8 dataﬂow engines (DFEs), each comprising a sin-
gle Altera Stratix-V 5SGSD8 FPGA (28 nm feature size) connected

12 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 13. The structure of intermediate cache.

to 48 GB of DDR3 RAM, and can communicate with other DFEs Table 10

FPGA Resource utilization of different models.
through MaxRing interconnections. Besides, two Intel Xeon E5-
2640 6-core CPUs (32 nm feature size) are included in the server, ALM DSP BRAM
and can communicate with the DFEs through InﬁniBand. Here we MNIST 182301(69.5%) 20(1.02%) 2210(86.09%)
take only one of the DFEs to implement the models. The Maxeler Cifar-10 219010(83.5%) 20(1.02%) 2210(86.09%)
system offers a convenient solution to support data communica- AlexNet 230918(88.0%) 384(19.6%) 2210(86.09%)
tion between software algorithms and FPGA hardware. Available 262400 1963 2567

7.2. Model preparation 7.4. Performance Analysis

We use Torch 7 framework [46] to train the NN models for We implement binarized models for the Xeon E5-2640 CPU, the
MNIST and Cifar-10 based on Hubara’s BNN framework [18] and NVIDIA Tesla K40 GPU and the Altera Stratix-V FPGA. Performance
for AlexNet based on Rastegari’s XNOR-Net framework [20]. The is measured as shown in Table 11. To feed CPU and GPU with
MNIST dataset is a permutation-invariant version consists of 60 K enough data, we take batch size to be identical to training for for-
examples of 28 × 28 gray level digit images for training and 10 K ward propagation. As we can see, with about an order of magni-
examples for testing. The Cifar-10 dataset consists of 50 K exam- tude slower clock frequency and much lower power consumption,
ples of 32 × 32 RGB colour images in 10 classes for training and our accelerator still gets an average speed-up of 314.07 times over
10 K examples for testing, and global contrast normalization and CPU and 19.08 times over GPU for MNIST, 51.83 times over CPU
ZCA whitening are used in the same way as Goodfellow et al. and 5.07 times over GPU for Cifar-10, and 11.67 times over CPU
[47] and Lin et al.[48] did. The ImageNet dataset consists of 1.2 M and 2.72 times over GPU for ImageNet. Peak speed-ups can reach
images from 1 K categories and 50 K images for testing, and a cen- 705.19 times over CPU and 70.75 times over GPU. Although the
ter crop of 224 × 224 is extracted for forward propagation. Adam model has been compressed for about 32 times, the low-precision
[49] learning rule is adopted for training, with a mini-batch size of operations can exploit the potential of fine-grained parallelism in
10 0, 20 0 and 800. The binarization method used here is determin- FPGA, which can offer higher performance than CPUs and GPUs. If
istic [50] considering the convenience for implementing hardware we take energy efficiency as the criterion, with similar feature size,
for inference. Model accuracies are measured and presented in the FPGA implementation can offer an efficiency of two to three
Table 9. As we can see, even the quantized versions for MNIST and orders of magnitude of CPU’s and GPU’s.
Cifar-10 keep high accuracies close to the state-of-the-art results. We take another comparison with some previous FPGA acceler-
XNOR-Net based AlexNet for our design suffers from a 13% accu- ator designs for CNN and BNN models, as listed in Table 12. We can
racy drop, while supporting state-of-the-art performance among see that our FP-BNN reaches a TOP/s speed which is significantly
the existing BNN solutions for ImageNet. faster than the previous CNN designs. For some designs (such as
that in [15]), one major problem is that for memory-centric FC lay-
ers, the data and parameter loading time is much longer than the
computing time as the number of input ports for data and weight
7.3. Hardware implementation RAMs is limited to 8, while in our design all the computing chan-
nels can be fed with data and weights in parallel. FP-BNN is also
We use MaxCompiler to generate the executable bit-stream for much faster than the most recent BNN design [40]. Although our
FPGA, which takes Altera Quartus II v13.1 to synthesize, place and design involves a large FPGA, the power efficiency is also 10 times
route the designs. The resource utilization of the final implementa- better. Another BNN design FINN [41] reaches a performance sim-
tion is shown in Table 10. The design can be driven with an achiev- ilar to ours. For the MNIST case, FINN has taken a smaller MLP
able 150 MHz clock. We notice that the utilization of DSP blocks is network in which the input dimension is larger than the number
not high, since only a small portion of arithmetic operations needs of neurons in each layer, which results in a higher resource utiliza-
floating-point multipliers. tion after task tiling. If we are prepared to reduce model accuracy

S. Liang et al. / Neurocomputing 000 (2017) 1–15 13

Table 11
Performance analysis among Xeon E5-2640 CPU, NVIDIA Tesla K40 GPU and Maxeler MAX4 (Stratix V) FPGA systems.

CPU NVIDIA Tesla K40 GPU Maxeler MPC-X20 0 0 with Stratix V FPGA

Core clock (MHz) 2.5 K (Base) / 3 K (Boost) 745 (Base) / 810 & 875 (Boost) 150

DDR memory – 12 GB GDDR5 @3.0GHz 48 GB DDR3 @1.6 GHz

Power 95 W 235 W(Board) 26.2 W(Board)

Model Macro layer Ops Time (ms) Perf (GOP/s) Time (ms) Perf (GOP/s) Time (ms) Perf (GOP/s) Speedup to CPU Speedup to GPU

MNIST 1(FC) 3.21 M 17.54 18.32 1.08 298.63 1.97 × 10−3 1633.89 89.19x 5.47x
2(FC) 8.39 M 44.28 18.95 2.57 326.61 6.87 × 10−4 12219.40 644.91x 37.41x
3(FC) 8.39 M 43.98 19.08 2.57 326.61 6.87 × 10−4 12219.40 640.51x 37.41x
4(FC) 40.96 K 0.77 5.33 0.26 15.76 5.33 × 10−5 768.19 144.17x 48.75x
Total 20.04 M 106.57 18.80 6.47 309.48 3.39 × 10−3 5904.40 314.07x 19.08x
Cifar-10 1(CONV) 7.08 M 19.70 71.85 4.38 323.20 4.10 × 10−2 172.61 2.40x 0.53x
2(CONV) 301.99 M 355.74 169.78 31.0 1947.69 2.74 × 10−2 11040.33 65.03x 5.67x
3(CONV) 151.00 M 140.85 214.40 14.7 2059.96 1.37 × 10−2 11021.55 51.41x 5.35x
4(CONV) 301.99 M 283.23 213.25 27.8 2170.09 2.05 × 10−2 14712.09 68.99x 6.78x
5(CONV) 151.00 M 146.28 206.44 16.2 1858.86 1.03 × 10−2 14678.75 71.10x 7.90x
6(CONV) 301.99 M 297.53 203.00 30.7 1965.06 1.71 × 10−2 17646.50 86.93x 8.98x
7(FC) 16.78 M 104.95 31.97 7.00 479.38 1.01 × 10−3 16667.13 521.27x 34.77x
8(FC) 2.10 M 12.35 33.98 0.92 455.14 2.60 × 10−4 8069.91 237.48x 17.73x
9(FC) 20.48 K 0.64 6.45 0.33 12.27 6.67 × 10−5 307.35 47.62x 25.05x
Total 1.23 G 1361.28 181.29 133 1853.87 1.3 × 10−1 9396.41 51.83x 5.07x
AlexNet 1(CONV)a 211.83 M 790.69 213.31 345.68 487.91 9.98 × 10−1 211.19 0.99x 0.43x
2(CONV) 895.80 M 2125.08 337.23 186.57 3841.23 5.84 × 10−2 15347.72 45.51x 4.00x
3(CONV) 299.04 M 1715.52 139.45 127.28 1879.54 2.03 × 10−2 14711.75 105.50x 7.83x
4(CONV) 448.56 M 1460.71 245.67 165.95 2162.39 2.71 × 10−2 16560.22 67.41x 7.66x
5(CONV) 299.04 M 1346.86 177.62 108.86 2197.61 1.81 × 10−2 16545.97 93.15x 7.53x
6(FC) 75.50 M 1640.79 36.81 168.97 357.44 4.31 × 10−3 17503.28 475.50x 48.97x
7(FC) 33.55 M 1229.86 21.83 123.40 217.54 2.18 × 10−3 15391.94 705.19x 70.75x
8(FC)a 8.19 M 499.78 13.66 30.00 218.40 2.74 × 10−2 298.47 21.85x 1.37x
Total 2.27 G 10789.29 168.35 1256.72 722.68 1.16 1963.96 11.67x 2.72x
a
The weights of these layers are quantized to 8-bit.
Table 12
Performance comparison with former FPGA-based CNN accelerator designs.

FPGA’16 [51] FPGA’16 [15] FPL’16 [32] FPGA’17 [40] FPGA’17 [41] This work

Stratix-V Zynq Virtex-7 Zynq Zynq Stratix-V

Platform 5SGSD8 XC7Z045 VX690T XC7Z020 XC7Z045 5SGSD8

Clock(MHz) 120 150 156 143 200 150

Precision (bit) 8–16 16 16 Input: 8 Input: 8 weight: 1 Input: 8 weight: 1 (8 for the ﬁrst and last layer of
weight: 1 AlexNet) others: 2–8 (MNIST & Cifar-10), 32
(AlexNet)

Model size 30.9 G 30.76 G 1.45 G 1.24 G MNIST Cifar-10 MNIST Cifar-10 AlexNet
(OPs)

5.8 M 112.5 M 20.02 M 1.23 G 2.27G

Performancea 117.8 136.97 (O) 565.94 207.8 (O) 9085.67 2465.5 5905.40 (O) 9396.41 (O) 1963.96 (O)
(GOP/s) 187.80 (C) 318.9 (C) 12219.40 (P) 17646.50 (P) 17503.28 (P)
1.20 (F)

Power (W) 25.8 9.63 30.2 4.7 22.6 11.7 26.2

Eﬃciency 4.57 14.22 22.15 44.2 402.02 210.72 225.36 (O) 358.64 (O) 74.96 (O)
(GOP/s/W) 466.39 (P) 673.53 (P) 668.06 (P)
a
O = Overall, P = Peak, C = CONV, F = FC.

for a smaller network, the overall performance should get closer overall throughput. Furthermore, we can exploit the heterogeneity
to the peak value (12 TOP/s). For the Cifar-10 case, our CONV-Net of logic elements in FPGAs, such as introducing different bit-width
model can achieve a throughput of almost 4 times of FINN’s. We choices together with binarized data for better use of DSP multi-
also support large datasets for our FP-BNN design, which proves pliers.
the compatibility of our design method with various CNN models. This implementation shows that it is promising to implement
BNN models especially for an embedded system, which can offer a
competitive speed and accuracy with low power consumption. Re-
7.5. Discussion cently, various designs [36,52] have shown that more complicated
NN models can also be binarized with tolerable loss of accuracy.
There is considerable scope for improvement in FP-BNN espe- Considering the similarity of component layers and logic genera-
cially for the ﬁrst layers, since datapath utilization is low due to tion algorithms, it is feasible to implement these models layer-by-
the limited number of input channels. Moreover, the utilization of layer in a sequential way as long as there is suﬃcient amount of
DSP blocks is low, and more DSP blocks can be involved if they on-chip memory for parameters.
can effectively support low-bandwidth operations to enhance the

14 S. Liang et al. / Neurocomputing 000 (2017) 1–15

8. Conclusion [16] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural net-
works with pruning, trained quantization and huffman coding, arXiv preprint
arXiv:1510.00149 (2015).
This paper presents FP-BNN – our design for binarized neural [17] F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer,
networks targeting FPGA technology. Based on the RAMA analy- Squeezenet: alexnet-level accuracy with 50x fewer parameters and less than
sis method, we design a 64-channel accelerator architecture, which 1MB model size, arXiv preprint arXiv:1602.07360 (2016).
[18] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neu-
can accommodate both CONV and FC type layers. An XNOR-based ral networks: training deep neural networks with weights andactivations con-
method is introduced for binarized vector MAC operations, and strained to +1 or −1, arXiv preprint arXiv:1602.02830 (2016).
the summing up process is achieved with a popcount compres- [19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A.Y. Ng, Reading digits in
natural images with unsupervised feature learning, in: Proceedings of the NIPS
sor tree which can be automatically generated. For small mod-
Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
els like MNIST MLP and Cifar-10 ConvNet, shift-based normaliza- [20] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: imagenet classifi-
tion is introduced which largely reduces the cost of multipliers. cation using binary convolutional neural networks, European Conference on
Computer Vision, Springer International Publishing (2016) 525–542.
With proper dynamic quantization to the input and parameters,
[21] V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, Efficient processing of deep neural net-
the model keeps good performance with the weights binarized works: a tutorial and survey, arXiv preprint arXiv:1703.09039 (2017).
and other parameters compressed by over 10 times. Optimized on- [22] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neu-
chip data storage is managed with parameter quantization. Our ral network architectures and their applications, Neurocomputing 234 (2017)
11–26.
implementation on Maxeler MPC-X20 0 0 platform (with Stratix-V [23] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by
5SGSD8 FPGA) shows a promising TOP/s speed with only 26.2 W reducing internal covariate shift, International Conference on Machine Learn-
power at 150 MHz clock frequency. We expect expect enhanced ing (2015) 448–456.
[24] A. Ng, J. Ngiam, C. Foo, Y. Mai, C. Suen, Backpropagation algorithm of ufldl
accuracy in future binarized models, which should greatly extend tutorial, http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm.
their range of applications. [25] G. Hinton, Neural Network for Machine Learning, Coursera, 2012.
[26] C. Farabet, C. Poulet, J.Y. Han, Y. LeCun, CNP: an FPGA-based processor for
Acknowledgement convolutional networks, in: Proceedings of the nineteenth International Con-
ference on Field Programmable Logic and Applications (FPL), IEEE, 2009,
pp. 32–37.
The support of Maxeler University Programme, Altera, In- [27] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: a smal-
tel, UK EPSRC (EP/P010040/1, EP/L00058X/1, EP/L016796/1 and l-footprint high-throughput accelerator for ubiquitous machine-learning, in:
Proceedings of the ACM Sigplan Notices, vol. 49, ACM, 2014a, pp. 269–284.
EP/N031768/1), the European Union Horizon 2020 Research and In-
[28] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
novation Programme under grant agreement number 671653, and et al., Dadiannao: a machine-learning supercomputer, in: Proceedings of the
the HiPEAC NoE is gratefully acknowledged. fourty seventh Annual IEEE/ACM International Symposium on Microarchitec-
ture, IEEE Computer Society, 2014b, pp. 609–622.
References [29] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S.
Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis of a
[1] Y. LeCun, C. Cortes, C. J. Burges, The MNIST database of handwritten digits, tensor processing unit, arXiv preprint arXiv:1704.04760 (2017).
1998, http://yann.lecun.com/exdb/mnist/. [30] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep
[2] A. Krizhevsky, V. Nair, G. Hinton, The CIFAR-10 dataset, 2014, https://www.cs. neural networks, in: Advances in Neural Information Processing Systems, 2016,
toronto.edu/∼kriz/cifar.html. pp. 2074–2082.
[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- [31] T.-J. Yang, Y.-H. Chen, V. Sze, Designing energy-efficient convolutional neural
thy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition chal- networks using energy-aware pruning, arXiv preprint arXiv:1611.05128 (2016).
lenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252. [32] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang, A high performance FPGA-based
[4] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van- accelerator for large-scale convolutional neural networks, in: Proceedings of
houcke, P. Nguyen, T.N. Sainath, et al., Deep neural networks for acoustic mod- the Twenty Sixth International Conference on Field Programmable Logic and
eling in speech recognition: the shared views of four research groups, IEEE Applications (FPL), IEEE, 2016, pp. 1–9.
Signal Process. Mag. 29 (6) (2012) 82–97. [33] P. Gysel, Ristretto: Hardware-oriented approximation of convolutional neural
[5] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, networks, arXiv preprint arXiv:1605.06402 (2016).
M. Chrzanowski, A. Coates, G. Diamos, et al., Deep speech 2: end-to-end [34] F. Li, B. Zhang, B. Liu, Ternary weight networks, arXiv preprint arXiv:1605.
speech recognition in English and Mandarin, International Conference on Ma- 04711 (2016).
chine Learning (2016) 173–182. [35] C. Zhu, S. Han, H. Mao, W.J. Dally, Trained ternary quantization, arXiv preprint
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, arXiv:1612.01064 (2016).
M. Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint [36] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, Y. Zou, Dorefa-net: training low
arXiv:1312.5602 (2013). bitwidth convolutional neural networks with low bitwidth gradients, arXiv
[7] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, preprint arXiv:1606.06160 (2016).
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Master- [37] H. Alemdar, N. Caldwell, V. Leroy, A. Prost-Boucle, F. Pétrot, Ternary neural net-
ing the game of go with deep neural networks and tree search, Nature 529 works for resource-efficient ai applications, arXiv:1609.00222 (2016).
(7587) (2016) 484–489. [38] W. Meng, Z. Gu, M. Zhang, Z. Wu, Two-bit networks for deep learning
[8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con- on resource-constrained embedded devices, arXiv preprint arXiv:1701.00485
volutional neural networks, in: Proceedings of the Advances in neural Infor- (2017).
mation Processing Systems, 2012, pp. 1097–1105. [39] R. Andri, L. Cavigelli, D. Rossi, L. Benini, YodaNN: an architecture for ultra-low
[9] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing hu- power binary-weight cnn acceleration, IEEE Trans. Comput. Aided Des. Integr.
man-level performance on imagenet classification, in: Proceedings of the IEEE Circuits Syst. PP (2017) 1–14.
International Conference on Computer Vision, 2015a, pp. 1026–1034. [40] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta,
[10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Z. Zhang, Accelerating binarized convolutional neural networks with soft-
in: Proceedings of the IEEE conference on computer vision and pattern recog- ware-programmable fpgas, in: Proceedings of the ACM/SIGDA International
nition, 2016, pp. 770–778. Symposium on Field-Programmable Gate Arrays, ACM, 2017, pp. 15–24.
[11] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, N. Andrew, Deep learning [41] Y. Umuroglu, N.J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, K. Vis-
with COTS HPC systems, in: Proceedings of the thirtieth International Confer- sers, Finn: a framework for fast, scalable binarized neural network inference,
ence on Machine Learning, 2013, pp. 1337–1345. in: Proceedings of the ACM/SIGDA International Symposium on Field-Pro-
[12] NVIDIA, Tesla K40 GPU Active Accelerator, NVIDIA, 2013. grammable Gate Arrays, ACM, 2017, pp. 65–74.
[13] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, Y. LeCun, Neuflow: [42] M. Kumm, P. Zipf, Pipelined compressor tree optimization using integer linear
a runtime reconfigurable dataflow processor for vision, in: Proceedings of the programming, in: Proceedings of the twenty fourth International Conference
IEEE Computer Society Conference on Computer Vision and Pattern Recogni- on Field Programmable Logic and Applications (FPL), IEEE, 2014, pp. 1–8.
tion Workshops (CVPRW), IEEE, 2011, pp. 109–116. [43] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with lim-
[14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, Optimizing FPGA-based ac- ited numerical precision, CoRR 392 (2015). abs/1502.02551
celerator design for deep convolutional neural networks, in: Proceedings of [44] D. Williamson, Dynamically scaled fixed point arithmetic, in: Proceedings of
the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, the IEEE Pacific Rim Conference on Communications, Computers and Signal
ACM, 2015, pp. 161–170. Processing, 1991., IEEE, 1991, pp. 315–318.
[15] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, [45] Maxeler, MPC-X series, https://www.maxeler.com/products/mpc-xseries/.
et al., Going deeper with embedded FPGA platform for convolutional neural [46] R. Collobert, K. Kavukcuoglu, C. Farabet, Torch7: A matlab-like environment
network, in: Proceedings of the ACM/SIGDA International Symposium on Field- for machine learning, in: Proceedings of the NIPS Workshop on BigLearn, in:
-Programmable Gate Arrays, ACM, 2016, pp. 26–35. EPFL-CONF-192376, 2011.

S. Liang et al. / Neurocomputing 000 (2017) 1–15 15

[47] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A.C. Courville, Y. Bengio, Maxout Leibo Liu received the B.S. degree in electronic engineer-
networks., ICML (3) 28 (2013) 1319–1327. ing from Tsinghua University, Beijing, China, in 1999 and
[48] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint arXiv:1312.4400 the Ph.D. degree in Institute of Microelectronics, Tsinghua
(2013). University in 2004. He now serves as an Associate Profes-
[49] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint sor in Institute of Microelectronics, Tsinghua University.
arXiv:1412.6980 (2014). His research interests include Reconfigurable Computing,
[50] M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: training deep neural net- Mobile Computing and VLSI DSP.
works with binary weights during propagations, in: Proceedings of the Ad-
vances in Neural Information Processing Systems, 2015, pp. 3123–3131.
[51] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, Y. Cao,
Throughput-optimized opencl-based FPGA accelerator for large-scale convolu-
tional neural networks, in: Proceedings of the ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, ACM, 2016, pp. 16–25.
[52] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural Wayne Luk received the M.A., M.Sc., and D.Phil. Degrees
networks: Training neural networks with low precision weights and activa- in Engineering and ComputingScience from the University
tions, arXiv preprint arXiv:1609.07061 (2016). of Oxford, Oxford, U.K. He is a Professor of Computer En-
gineering with Imperial College London, London, U.K. He
Shuang Liang received the B.S. degree from the Institute was a Visiting Professor with Stanford University, Stan-
of Microelectronics, Tsinghua University, Beijing, China, in ford, CA, USA. His current research interests include the-
2011. He is working toward the Ph.D. degree at the In- ory and practice of customizing hardware and software
stitute of Microelectronics, Tsinghua University, Beijing, for specific application domains, such as multimedia, net-
China. He was a visiting scholar at the Department of working, and finance.
Computing, Imperial College London, UK in 2016. His re-
search interests include reconfigurable computing, hard-
ware acceleration of machine learning algorithms and dis-
tributed systems.

Shaojun Wei was born in Beijing, China in 1958. He re-

ceived Ph.D. degree from Faulte Polytechnique de Mons,
Belgium, in 1991. He became a professor in Institute of
Microelectronics of Tsinghua University in 1995. He is a
Shouyi Yin received the B.S., M.S. and Ph.D. degrees in senior member of Chinese Institute of Electronics. His
Electronic Engineering from Tsinghua University, China, in main research interests include VLSI SoC design, EDA
20 0 0, 20 02 and 20 05, respectively. He has worked in Im- methodology, and ASIC design.
perial College London as a research associate. Currently
he is with Institute of Microelectronics at Tsinghua Uni-
versity as an Associate Professor. His research interests
include SoC design, reconﬁgurable computing and mobile
computing. Prof. Yin has published more than 40 refer-
eed papers, and served as TPC member or reviewers for
the international key conferences and leading journals.

Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046

InTouch CSP3
No ratings yet
InTouch CSP3
12 pages
Finals Elec111
No ratings yet
Finals Elec111
4 pages
Midjourney Beginner's Guide - by ChristieC.
100% (1)
Midjourney Beginner's Guide - by ChristieC.
13 pages
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
ElieNicolas BNNs
No ratings yet
ElieNicolas BNNs
16 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
Paper 8
No ratings yet
Paper 8
7 pages
Zhang 2021
No ratings yet
Zhang 2021
12 pages
Finn
No ratings yet
Finn
10 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Beanna
No ratings yet
Beanna
5 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
4 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
Rongshi 2019
No ratings yet
Rongshi 2019
4 pages
Learning To Train A Binary Neural Network
No ratings yet
Learning To Train A Binary Neural Network
16 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
No ratings yet
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
7 pages
Pal 2025 Eng. Res. Express 7 015317
No ratings yet
Pal 2025 Eng. Res. Express 7 015317
16 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
Electronics 11 00663
No ratings yet
Electronics 11 00663
14 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
No ratings yet
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
19 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
No ratings yet
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
11 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
No ratings yet
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
4 pages
BNN (Binaray+Neural+Network) PYNQ Online Course
No ratings yet
BNN (Binaray+Neural+Network) PYNQ Online Course
7 pages
Mipsology Aws f1
No ratings yet
Mipsology Aws f1
10 pages
Accelerating Deep Neural Networks Implem
No ratings yet
Accelerating Deep Neural Networks Implem
18 pages
Low Precision Networks For Efficient Inference On Fpgas White Paper
No ratings yet
Low Precision Networks For Efficient Inference On Fpgas White Paper
6 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Polylut: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning
No ratings yet
Polylut: Ultra-Low Latency Polynomial Inference With Hardware-Aware Structured Pruning
13 pages
A Lightweight Binarized Convolutional Neural Network Model For Small Memory and Low-Cost Mobile Devices
No ratings yet
A Lightweight Binarized Convolutional Neural Network Model For Small Memory and Low-Cost Mobile Devices
11 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
10 1109vdat50263 2020 9190274
No ratings yet
10 1109vdat50263 2020 9190274
6 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
Enabling BNN by Edge
No ratings yet
Enabling BNN by Edge
19 pages
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Quick Start 3600 Cassette Loading and Image Acquisition Captera
No ratings yet
Quick Start 3600 Cassette Loading and Image Acquisition Captera
8 pages
Brochure Glass Analytics en Tool Description 2019
No ratings yet
Brochure Glass Analytics en Tool Description 2019
2 pages
HMU15N en
No ratings yet
HMU15N en
20 pages
Software Lab File Ab
No ratings yet
Software Lab File Ab
15 pages
Action Script Lesson
No ratings yet
Action Script Lesson
4 pages
Gripper Spottech Software PDF
No ratings yet
Gripper Spottech Software PDF
60 pages
MB Manual B550-Ud-Ac 1201 e 231211
No ratings yet
MB Manual B550-Ud-Ac 1201 e 231211
30 pages
SDRconnect Release Notes
No ratings yet
SDRconnect Release Notes
9 pages
Computer Graphics and Multimedia Notes 3
No ratings yet
Computer Graphics and Multimedia Notes 3
18 pages
Programming-Guideline-Safety DOC V1 2 en
No ratings yet
Programming-Guideline-Safety DOC V1 2 en
42 pages
SyncPay System Design Document
No ratings yet
SyncPay System Design Document
15 pages
Ict 2ND Term Lesson Note Original
No ratings yet
Ict 2ND Term Lesson Note Original
81 pages
FactoryTalk ViewPoint Quick Start Guide
No ratings yet
FactoryTalk ViewPoint Quick Start Guide
38 pages
دفاعا عن السوفسطائيين لـ الطيب بوعزة
No ratings yet
دفاعا عن السوفسطائيين لـ الطيب بوعزة
354 pages
Information and Communications Technology
No ratings yet
Information and Communications Technology
99 pages
Final
No ratings yet
Final
19 pages
Process Description and Control
No ratings yet
Process Description and Control
42 pages
MGU-MGG200: 4K, 3Gb/S, HD, SD Sdi Eightfold Multiview Building Block
No ratings yet
MGU-MGG200: 4K, 3Gb/S, HD, SD Sdi Eightfold Multiview Building Block
44 pages
II Puc All Mcqs Cs
No ratings yet
II Puc All Mcqs Cs
21 pages
Unit 04. MAD (22617)
No ratings yet
Unit 04. MAD (22617)
22 pages
Jozsef Attila Versek
No ratings yet
Jozsef Attila Versek
13 pages
Sami Salama Hussen Hajjaj, Kisheen Rao Gsangaya - The Internet of Mechanical Things - The IoT Framework For Mechanical Engineers (2022, CRC Press) - Libgen - Li
No ratings yet
Sami Salama Hussen Hajjaj, Kisheen Rao Gsangaya - The Internet of Mechanical Things - The IoT Framework For Mechanical Engineers (2022, CRC Press) - Libgen - Li
253 pages
Ict Final Exam (2014)
No ratings yet
Ict Final Exam (2014)
7 pages
Artoolkit
No ratings yet
Artoolkit
137 pages
Development of Android Based Application For Philippine Coordinate Transformation (Phgeocalc)
No ratings yet
Development of Android Based Application For Philippine Coordinate Transformation (Phgeocalc)
86 pages
Introduction To Environmental Management
No ratings yet
Introduction To Environmental Management
60 pages
ICDL Digital Student Application Essentials Syllabus 1.0
No ratings yet
ICDL Digital Student Application Essentials Syllabus 1.0
7 pages

BNN in FPGA

Uploaded by

BNN in FPGA

Uploaded by

JID: NEUCOM

ARTICLE IN PRESS [m5G;October 28, 2017;13:0]

Neurocomputing 0 0 0 (2017) 1–15

Contents lists available at ScienceDirect

FP-BNN: Binarized neural network on FPGA

1. Introduction quentially, which leads to low eﬃciency. Graphics processing units

2 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Table 1 the input feature-map (fmap) I in a sliding-window manner with

where is deﬁned as convolution, which equals to K2 element-

2.1. Basics of CNN

S. Liang et al. / Neurocomputing 000 (2017) 1–15 3

Fig. 1. A typical CNN model structure.

Operation Function plots Derivative plots

ex +e−x −0.5 0.2

4 S. Liang et al. / Neurocomputing 000 (2017) 1–15

tized ﬁxed-point strategy to the on-chip data [15,27,28,32,33] pre-

NCONV (W ) = Nin × Nout × K 2 (12)

S. Liang et al. / Neurocomputing 000 (2017) 1–15 5

Fig. 3. A normal structure of a BNN model.

MNIST MLP 1 F-B-A 784 − − − 2048 1.61 M 10240 1.61 M 2048

6 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 4. The overall system architecture design.

Original multiplication Aﬃne transformed

a −1,1 b −1,1 a·b −1,1 a 0, 1 b 0, 1 a·b 0, 1

of vector A −1,1 and B −1,1 of length vec_len, then we will have

This means we need to add the popcount of vector B 0,1 in-

S. Liang et al. / Neurocomputing 000 (2017) 1–15 7

are less than 4. The overall process of generating a compressor tree

4.2.3. PE reuse w = (wnp−1 w p−2 ...w0n−1 , wnp−1

8 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 7. The C/F layer PE module.

Fig. 8. The SBN layer PE module (MNIST and Cifar-10).

S. Liang et al. / Neurocomputing 000 (2017) 1–15 9

Model NPE PEsize Lin of layer

MNIST 1024 784 1024

will be implemented through XNOR + Popcount, and the m-bit of

6. Memory system design

In this section, we introduce the memory system design for FP-

6.1. Quantization over other parameters

Since in BNN models, weights have already been binarized, so

10 S. Liang et al. / Neurocomputing 000 (2017) 1–15

S. Liang et al. / Neurocomputing 000 (2017) 1–15 11

Model Layer FC/CONV Batch normalization

Min Max Q Min Max Q Min Max Q Min Max Q

Model Accuracy Parameter Size

C/F Weights (M bit) Others (K bit)

MNIST Original (98.7 ± 0.2)% 305.63 961.56

Fig. 12. Memory storage management pattern for weight cache.

7. System evaluation 7.1. System environment

12 S. Liang et al. / Neurocomputing 000 (2017) 1–15

Fig. 13. The structure of intermediate cache.

to 48 GB of DDR3 RAM, and can communicate with other DFEs Table 10

7.2. Model preparation 7.4. Performance Analysis

S. Liang et al. / Neurocomputing 000 (2017) 1–15 13

DDR memory – 12 GB GDDR5 @3.0GHz 48 GB DDR3 @1.6 GHz

Power 95 W 235 W(Board) 26.2 W(Board)

Stratix-V Zynq Virtex-7 Zynq Zynq Stratix-V

Clock(MHz) 120 150 156 143 200 150

5.8 M 112.5 M 20.02 M 1.23 G 2.27G

Power (W) 25.8 9.63 30.2 4.7 22.6 11.7 26.2

14 S. Liang et al. / Neurocomputing 000 (2017) 1–15

S. Liang et al. / Neurocomputing 000 (2017) 1–15 15

Shaojun Wei was born in Beijing, China in 1958. He re-

You might also like