BNN in FPGA
BNN in FPGA
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Deep neural networks (DNNs) have attracted significant attention for their excellent accuracy especially
Received 10 December 2016 in areas such as computer vision and artificial intelligence. To enhance their performance, technologies
Revised 10 August 2017
for their hardware acceleration are being studied. FPGA technology is a promising choice for hardware ac-
Accepted 17 September 2017
celeration, given its low power consumption and high flexibility which makes it suitable particularly for
Available online xxx
embedded systems. However, complex DNN models may need more computing and memory resources
Communicated by Dr. Deng Cheng than those available in many current FPGAs. This paper presents FP-BNN, a binarized neural network
(BNN) for FPGAs, which drastically cuts down the hardware consumption while maintaining acceptable
Keywords:
Binarized neural network accuracy. We introduce a Resource-Aware Model Analysis (RAMA) method, and remove the bottleneck in-
Hardware accelerator volving multipliers by bit-level XNOR and shifting operations, and the bottleneck of parameter access by
FPGA data quantization and optimized on-chip storage. We evaluate the FP-BNN accelerator designs for MNIST
multi-layer perceptrons (MLP), Cifar-10 ConvNet, and AlexNet on a Stratix-V FPGA system. An inference
performance of Tera opartions per second with acceptable accuracy loss is obtained, which shows im-
provement in speed and energy efficiency over other computing platforms.
© 2017 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.neucom.2017.09.046
0925-2312/© 2017 Elsevier B.V. All rights reserved.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
a promising method to compress the NN models, which can di- XY= X(i, j ) · Y(i, j ) (2)
rectly shrink the bit-width of inputs and weights from 32 bit i=1 j=1
(single-precision floating-point) to a single bit. Recently, Cour- FC layer: The FC layer will operate a linear transformation on
bariaux et al. [18] introduced a method to train binarized neu- the input 1-D vectors with a weight matrix. The pattern of the
ral networks (BNNs) over MNIST, Cifar-10 and SVHN [19] datasets, input-output network is fully-connected, which is how it got its
with near state-of-the-art accuracy. Shortly after that, Rastegari name. This process can be shown as:
et al. [20] announced they successfully trained ImageNet models
(l )
Nin
with BNN-based XNOR-Net method with an accuracy of 12.4% be-
(l )
(l )
low the full precision AlexNet, and provides a 58 times speed- A (n ) = B (n ) + I(l ) (m ) · W(l ) (m, n ) (3)
up and 32 times model size compression. The emergence of bi- m=1
narized models makes it feasible to implement a system on FP- POOL layer: The POOL layer realizes a “down-sampling” opera-
GAs with much higher performance than floating-point versions. tion, which compresses the input images into smaller scales. We
This motivates us to design a method to take a given BNN model take the most common max-POOL as an example, which extracts
and generate the datapath logic and data management pattern on the maximum value from the K × K kernel window as the output:
FPGA based to an optimization metric, which forms an accelera-
tor system targeting Tera operations per second’s(TOP/s) through- A(l ) (i, j ) = max[IK(l×K
)
(i, j ) ] (4)
put speed.
In this paper, we introduce FP-BNN, a BNN acceleration system Activation Layer: Just like biological neurons, we say they are
design on FPGA, with related optimizations. The contributions of “firing” once the key value exceeds the threshold and are “silent”
this paper are as follows: if not. Various activation functions are implemented in neural net-
work designs to imitate the neurological behaviour such as ReLU,
- An analytical resource aware model analysis (RAMA) to assess tanh, sigmoid, etc., which also introduce non-linearity to the net-
the resource cost, to help on-chip system architecture design. works.
- A datapath design with multipliers replaced by XNOR, popcount Batch Normalization (BN) layer: Since the distribution of each
and shifting operations for BNNs, and a compression tree gen- layer’s input can fluctuate during training, Batch Normalization
eration method for more efficient popcount. [23] is introduced to speed up training. For a d-dimensional input
- An optimized data managing pattern with parameter quantiza- vector x = (x(1 ) , x(2 ) , . . . , x(d ) ), we can normalize each dimension
tion and on-chip storage strategy. with:
- A demonstration with popular small (MNIST MLP and Cifar-
x(k ) − E[x(k ) ]
10 ConvNet) and large (AlexNet) models implemented on FPGA x (k ) =
(5)
in binarized style, achieving a performance of TOP/s with high V ar[x(k ) ]
power efficiency.
After that, for each activation x(k) , we should scale and shift the
The rest of the paper is organized as follows. Section 2 reviews normalized value to achieve an identity transform:
the basic concepts of CNN and BNN and discuss on the related y ( k ) = γ ( k )
x (k ) + β (k ) (6)
works. Section 3 describes the RAMA method. Section 4 presents
the system design and the details of each processing element (PE). where andγ (k) β (k)
are to be learned during the training process.
Section 5 explains how we tile and schedule the large computing The whole process is described in Algorithm 1.
task onto our system. Section 6 covers a data quantization to com-
press the model, and introduces the on-chip design of the memory Algorithm 1 Batch Normalization [23].
system. Evaluation will be discussed in Section 7, and conclusion 1: Require: A mini-batch of input values: B = {xi }, i = 1 ∼ m; Ini-
will be given in Section 8. tialized parameters: γ , β .
2: Ensure: Updated γ , β ; Output yi = BNγ ,β (xi ), i = 1 ∼ m.
1
m
2. Background
3: μB = m xi ; //Get mini-batch’s mean
i=1
In this section, we will first provide an overview of the basic
m
4: σB2 = 1
m (xi − μB )2 ; //Get mini-batch’s variance
concepts of CNN, and then explain how a binarized NN works. i=1
Based on these concepts, we take a brief overview of related ef- xi −μB
5: xˆi = √ ; //Normalize
forts and discuss them. σB 2 + ε
6: yi ≡ BNγ ,β (xi ) = γ xˆi + β ; //Scale and shift
Fig. 1 shows a typical CNN model structure [22]. A CNN model 2.2. Training a CNN
usually consists of CONV layer, FC layer and Pooling (POOL) layer,
forming a trainable network. CONV layer: The CONV layer realizes a A given CNN model with initialized parameters should be
filter-like process, which uses a K × K weight kernel W to convolve trained on a certain dataset in order to approximate the ideal
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 2
Comparison between activation function Tanh, sign and HTanh.
0.8
0.5
x −x
0.6
e −e
T anh(x) =
x
−4 −2 2 4 0.4
x
−1 −4 −2 2 4
⎧ 1
f (x) 1.2
f (x)
⎨ +1 x≥0 0.5
1
0.8
sign(x) =
x 0.6
−2 −1 1 2
⎩ −1
0.4
x<0
−0.5
0.2
x
−1
⎧
−4 −2 2 4
⎪
⎪ +1 x>1
⎪
f (x) 1.2
1 f (x)
⎨ 0.5
1
0.8
HT anh(x) = x −1 ≤ x ≤ 1
x 0.6
−2 −1 1 2
⎪
0.4
⎪
−0.5
⎪
0.2
⎩ −1
x
−1
x < −1
−2 −1 1 2
model for ground-truth results. The most commonly used training tion:
method is Back-Propagation (BP) training, which consists of two
+1 x>1
stages: est (Sign(x )) = H Tanh(x ) = x −1 ≤ x ≤ 1 (8)
−1 x < −1
(1) Forward propagation (Inference), which leads the input data go-
ing through the network to get an output result; Assume the required gradient is ddCu , and a = Sign(u ), then we will
(2) Back propagation, which calculates the error between output have the estimator of the gradient:
and ground-truth labels with a defined loss function C, and then dC
propagates the gradient of each layer’s output function back-
dC dC dSign(u ) −1 ≤ u ≤ 1
est = · = da
(9)
wards to update the weights in order to minimize the loss func- du da du 0 otherwise
tion for the next training iteration. Since BN layers have the effect of avoiding internal covariate
shift, which can accelerate the training process and reduce the im-
Detailed derivation can be found in [24]. Since the overall pro- pact of binarization, [18] introduces BN layers in their BNN mod-
cess is compute-intensive, high-performance servers with acceler- els. To deal with the large amount of multiplications in BN, they
ators such as GPUs are often used in training. Then the pretrained replace them with shift operations to get a Shift-Based BN (SBN).
models can be used in many real-time scenarios by going through This can largely reduce the computing resource cost with only a
inference process only with minor changes, which can be imple- small loss of precision – which actually can be healed through
mented on many embedded hardware platforms. the training process. The SBN replacement can be described as
Eq. (10) where sal(x, y) means an arithmetic left shift to x by y
bits:
2.3. How BNN works
x · y ≈ sal [x, round (log2 |y| )] · sign(y ) (10)
The essential idea of BNN is to constrain both weights and ac-
tivations to +1 and −1 [18]. The binarization method can be done 2.4. Related work
in either stochastic or deterministic way, and the latter is often re-
alized by the Sign function: To accelerate an NN model in embedded hardware, spade hus-
bandry should be taken. There has been many efforts deploying
+1 x≥0 CNN models in hardware. Farabet et al. [26] designed a 3 CONV
xb = Sign(x ) = (7)
−1 x<0 layers +5 FC layers simple face detection system on FPGA with 10
frames (512 × 384) per second’s performance. Zhang et al. [14] pro-
The problem is that during the training process, the derivative of posed a nested-loop model to describe CNN, and accelerates CONV
the Sign function is almost zero everywhere (as shown in Table 2), layers only under the guidance of a roofline model. Qiu et al.
resulting in an incompatibility with the BP training process. Hinton [15] realized an even deeper VGG model on FPGA. Most of these
[25] introduced a “straight-through estimator” to cope with this previous designs store weights and fmaps off-chip since their size
problem. Courbariaux et al. [18] used a similar estimator in a de- is too large for on-chip storage. As a result, the dataflow bandwidth
terministic way, which can be seen as a hard tanh (HTanh) func- is limited and frequent off-chip memory access happens. So some
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
designs support dedicated memory cache for on-chip data reuse Table 3
Resource cost of MACs on Stratix V FPGA.
[27–29], but the increase of memory placement means fewer arith-
metic resources since chip area is limited. Operation LUT FF DSP
Clearly a small model that supports high accuracy and high per- 32-bit float add( + ) 581 525 0
formance is ideal. One method is to exploit the sparsity inside the x-bit fixed add( + ) x x+1 0
model by pruning off connections [16,30,31]. Another method is to 32-bit float mult( × ) 147 363 1
x
reduce bit-width of operations. Much previous work took a quan- x-bit fixed mult( × ) 0 1 18
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 4
RAMA-based topology analysis of MNIST MLP, Cifar-10 CONV-Net and AlexNet.
Macro Layer Structurea Nin × R2in K S KPOOL b Nout × R2out N (W ) N (Others ) N (MAC ) N (A )
4.1. Overall architecture Next, we take a look at the details of different types of PE de-
sign.
A normal structure of a BNN model is given in Fig. 3. We can
divide the model into several macro-layers with similar structures,
4.2. C/F PE
each including a convolution or fully-connected (C/F) layer, a batch
normalization (BN) layer and an activation layer which consists of
4.2.1. XNOR-based Binary MAC
a Hard Tanh (HTanh) layer and a Binarized Neuron (BNeu) layer.
Normally, it is necessary to utilize DSPs or customized LUT-
For some macro layers, pooling is introduced for down-sampling.
based logic to complete a MAC operation for both floating-point or
Here we choose MNIST MLP and Cifar-10 ConvNet as small dataset
fixed-point input values. However, if input values become binary,
examples, and AlexNet for large dataset ImageNet. The topology of
it will be much different.
each model is described in Table 4, and the key features of each
Consider two input vectors A = {ai } and B = {bi } (i = 1 to Nin )
layer are extracted based on RAMA, in which Rin and Rout are re-
which consist of binarized values either +1 or −1, then the prod-
spectively the input and output image size, K is the convolution
uct of the corresponding elements in two vectors will also be ei-
kernel (window) size, S is the stride of the moving window, and
ther +1 or −1. The sign of the product depends on the two input
KPOOL is the pooling window size.
elements’ signs - if they are identical, then the product will be pos-
The overall system is shown in Fig. 4. We have altogether NPE
itive, otherwise it will be negative. Then, we need to accumulate
channels to process in parallel the data from the input cache.
these binary values to get a final result. This process is depicted in
CONV/FC (C/F) layer includes processing elements (PEs) that are
Fig. 5(a).
shared by the CONV and FC since they both mainly consist of MAC
Hardware implementations usually take 2 bits to represent +1
computations. Shift-Based Normalization (SBN) layer adopts shift
and −1. If we use only one bit, we should take 0 and 1 as the
operations to replace multiplications as mentioned in Section 2.3.
basic values. This can be achieved through affine transformation.
Activation layer merges the HTanh and BNeu layers together to
Since we have
produce an output vector containing either 0 or 1. Parameters for
each layer are fetched from on-chip BRAMs or registers to meet A −1,1 +A 1
A 0,1 = (15)
bandwidth requirements, and control signals select them for each 2
iteration. The output for each iteration will be transferred to the
in which A 1 represents the all-1 vector of the same length of
intermediate result cache. For each next layer, the interconnection
A −1,1 . To keep the truth table for the result as shown in Table 5,
will be reconfigured by the controller according to the type (CONV
we can infer that the operation should be transformed from multi-
or FC) of the layer.
plication to XNOR. In addition, if we assume r to be the dot product
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 5
Truth table of affine transformed inputs and result.
1 1 1 1 1 1
1 −1 −1 1 0 0
−1 1 −1 0 1 0
−1 −1 1 0 0 1
vec _len
result = A −1,1 ·B −1,1 = ai −1,1 · bi −1,1
i=1
vec _len
= (ai −1,1 · bi −1,1 ) − (−ai −1,1 · bi −1,1 )
i=1
vec _len
= X NOR(ai 0,1 , bi 0,1 ) − XOR(ai 0,1 , bi 0,1 )
i=1
= 2 popcount (R 0,1 ) − vec_len (16)
in which
R 0,1 = {X NOR(ai 0,1 , bi 0,1 ), i = 1 to vec_len} (17)
If one of the inputs is already 0, 1 based, for example, the first
layer, then we get the result with:
+A 1 A −1,1
Fig. 5. Conversion from (a) −1, 1 -based MAC to (b) 0, 1 -based XNOR and pop-
result = A 0,1 ·B −1,1 =
· B −1,1
2
count operations.
2 popcount (R 0,1 ) − vec_len bi −1,1
= +
2 2
= popcount (R 0,1 ) − vec_len + b i 0,1 (18)
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Fig. 6. The popcount compressor tree based on 6:3 compressors and one ternary adder.
signal selects the operation to the output of the popcount com- Algorithm 2 Popcount compressor tree generation algorithm.
pressor tree.
1: Require: Input vector: i of height N
2: Ensure: Updated: Column vector height h(i, j ), i stands for the
4.2.2. Popcount Compressor (PC) tree weight of 2i and j for the compression stage; Heap of stage j:
The popcount value, also known as Hamming Weight, can easily H ( j ) = {h(k, j )}, k = 0, 1, . . . , log2 (N ).
be calculated in parallel hardware. However, for long vectors, this 3: h(0, 0 ) = N, i = 0, j = 0;
process can be demanding both in time and in resource usage. The 4: while max(H ( j )) > 3 do
most common way is to use a binary full adder tree to sum up the 5: H ( j + 1 ) = zeros(1, log2 (N ));
bits in vectors, which will result in a delay of l og2 (vec_l en ) and 6: for k = 1 to log2 (N ) do
n − 1 adders of different bitwidth. Here we present a compressor 7: if h(k, j ) > 3 then
tree method inspired by [42]. 8: ncompressor (k, j ) = h(k, j )/6;
The popcount process can be seen as compressing N input bits 9: h(k, j + 1 ) = h(k, j + 1 ) + ncompressor (k, j );
into log2 N + 1 result bits with weights. Since most modern FPGA 10: h(k + 1, j + 1 ) = h(k + 1, j + 1 ) + ncompressor (k, j );
architectures have 6-input LUTs, a 6:3 compressor (can be seen as 11: h(k + 2, j + 1 ) = h(k + 2, j + 1 ) + ncompressor (k, j );
N = 6) is therefore an efficient basic component, for it can calcu- 12: else
late the popcount of a 6-bit input vector in a look-up table, which 13: h(k, j + 1 ) = h(k, j + 1 ) + h(k, j );
leads to a 3-bit popcount output with only three 6-input LUTs in 14: end if
parallel. 15: end for
Given that tuple T = ( pk ; qk+m−1 , . . . , qk+1 , qk ) represents a pk : 16: j = j + 1;
m compressor, where the subscript j = k, k + 1, k + 2 . . . stand for 17: end while
the bit weight 2j and pj , qj stand for the input and output bit num-
ber of certain weight, respectively. In this way, a 6:3 compressor
Table 6
can be represented as (6; 1, 1, 1). Comparison between accumulation adder tree and popcount com-
As shown in Fig. 6, the input vector is divided into 6-bit por- pressor tree.
tions, each connected to the input ports of a 6–3 compressor.
BWin (bits) BWout (bits) LUTs
Empty input bits will be filled with dummy 0’s (shown as hollow
dots in Fig. 6). For the following stages, the bits with the same Acc. Pop. Saved (%)
weight (we call it column vector) will repeat the same process, 2
9 (3 ) 4 9 10 −11.1
forming a compressor tree. The output bits will heap due to their 16 5 21 19 9.52
64 7 98 79 19.39
weights. Our target is to reduce the height of the heap to 3, which
256 9 398 291 26.88
can then be accepted as three input vectors of a ternary adder to 1024 11 1596 1106 30.70
get the final sum. Thus, the column vectors with height of less 1152 (128 × 32 ) 11 1796 1228 31.63
than 4 will stop being compressed for the next stage, and the com- 1200 (48 × 52 ) 11 1864 1282 31.22
pression process will terminate when all column vectors’ heights 8192 14 12768 8362 34.51
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Implementing Eq. (21) requires reuse of PE. Hence, we intro- For all models, we train them with the last BN layer kept as
duce an accumulator to the PE with a selectable left-shifter. While the original batch normalization in order to avoid accuracy loss,
the precedent lower bit vector is being processed, the next input that is, with no shift-based operations. Floating-point multiplica-
vector can be loaded behind, and be added with the shifted result tions are implemented independently with DSP blocks to accom-
of the precedent vector. The start and the end of the accumulation plish the multiplication operations, and for ImageNet classification
will be set by the controller signal. A detailed scheduling will be (AlexNet), the output process will be tiled.
introduced in Section 5.
The overall structure of C/F PE is given in Fig. 7. For AlexNet, the 4.4. Activation PE
binarization method is different from sign function[20]. It intro-
duces a binarized filter wbin for w with a scaling factor α in order For the last part of a normal macro layer, we need to binarize
to approximate the MAC operation by x · w ≈ α (x · wbin ), where the result into either 0 or 1. In the training process, the HTanh
wT wbin
α= n = 1n w1 . So for AlexNet the accumulator is followed function, as shown in Table 2, constricts the values between −1
by a multiplier to time the scaling factor α . and 1, while the final BNeu layer will push those values in between
to the two boundaries, which means −1 for all the negatives and
4.3. BN PE 1 for the others. Since we introduce in Section 4.1 that the (−1,1)
based vectors can be affine transformed into (0,1) vectors, the case
As described in Algorithm 1[23], the batch normalization pro- becomes much simpler: we just need to discern all the negative
cess can be presented as: values from the SBN layer and set them to 1, with the others to
x−μ 0. This can be done directly by accessing the signal bit of these
y= · γ + β, (22) values.
σ
where μ stands for the running mean value, σ stands for the stan-
dard deviation. γ and β are the learnt values to implement affine 4.5. Pooling
scale and shift for an identity transform. However, as mentioned
in Section 2.3, floating-point multiplications are required for every For some macro-layers, pooling is applied to support sub-
normalization process, which will lead to a considerable resource sampling in order to reduce the output fmap size. As shown in
cost. For this reason, [18] uses shifting to approximate the multiply Table 4, pooling comes closely after CONV layer, and the pooling
operation. For Eq. (22), the shift-based approximation would be: type for all of our models is max-pooling. If pooling comes after
activation, most of the output values will be +1 which result in
γ
y = sal [(x − μ ), φ ] · sign + β,
σ
(23) significant information loss for training [20]. So a C-P-B-A macro-
γ layer structure is taken. However, for the inference process, a C-B-
where φ = round (log2 σ ) is the left-shift value of both σ and γ . A-P structure can get an identical result and the pooling is applied
As we have the pre-trained models in hand, we can calculate to values of 0 and 1 only. This can be directly implemented with
γ
the required parameters for BN like σ in advance, and store them OR operations. We organize selective line buffers after the activa-
into the corresponding parameter cache. With the above, we get tion PEs, and when a pooling process is required, the activation
the SBN PE as presented in Fig. 8. We have also noticed that the values will stream into the line buffers. If the pooling size is K, we
shifting-based approximation would cause severe accuracy drop for will enable OR operations to the horizontal targeted locations once
AlexNet (from 42.9% to 31.9%), so we avoid using shift replace- K rows of activations arrive, and then a similar process is repeated
ment for AlexNet and keep the original batch normalization, using along the vertical direction with other line buffers to complete a
a multiplier to replace the shift and sign operations. 2D max-pooling.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 7
Tiling strategy for different models.
1 2 3 4 5 6 7 8 9
t
Lin = Nin · K 2 · 2i−1 (24)
i=1
and through this tiling, the utilization rate of PE for the first layer
is increased.
With RAMA and the information given in Tables 4 and6, our
tiling strategy is shown in Table 7.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Fig. 10. Example of tiling for multiple bit case (2-bit input and 1-bit weight).
Fig. 11. The model accuracy variation as a function of bitwidth of BN’s mean μ ((a)MNIST; (c)Cifar-10) and affine bias β ((b)MNIST; (d)Cifar-10), respectively.
For the biases of C/F layers, we discover that even if their ConvNet, the weights can fit into an array of BRAMs. For AlexNet,
bitwidth drops to 0, there is still no significant variance of the re- the weights will be tiled by each layer or inside a layer once they
sults. So, to save storage and computing resources, we ignore the get too large. The oldest weights for finished tiles will be covered
C/F layer bias addition. For the running means μ and the affine by weights for the next tile from off-chip memory in a ping-pong
biases β , the point varies between 2 to 8-bit for different layers. fashion. For other parameters, a similar storage structure is pro-
For the shift parameter of BN layers, we can express each φ as posed based on MLABs.
φmin + φ , and therefore we only need to store the variance us-
ing φ to reduce resource usage. So the bitwidth w of BW can be
calculated as:
log2 (φmax − φmin ) + 1, φmax > φmin
w= (27) 6.3. Intermediate cache
0 φmax = φmin
We choose a dynamic fixed-point strategy [44] to optimize the For the intermediate outputs, that is, the output fmaps of each
bitwidth for each layer. Table 8 shows all the value ranges and BW macro layer, we place a cache to hold them for the next macro
strategies that we have chosen for each parameter, and Table 9 layer to read. The design of intermediate cache is given in Fig. 13.
shows the comparisons between the quantized models and the For CONV layers, the cache structure facilitates the sliding window
original models. data fetching. For each input iteration, we separate the intermedi-
ate cache into TNin groups, each containing K memory blocks, and
6.2. Memories for parameters every two adjacent blocks storing consecutive rows. The input logic
needs to offer corresponding addresses for the required K rows,
The memory storage structure for weights is given in Fig. 12. In and select rows to choose the horizontal required K-bits. So in to-
order to reduce the memory access time, we keep the parallelism tal we get TNin × K 2 windows to form an Lin long tensor. The output
of memories identical to the number of PE channels NPE , and the fmaps of each layer will be stored into the spared space from the
width of each memory equals to P Esize . Weights for each comput- input fmaps in a ping-pong way. For FC layers, we just need to en-
ing tile, which is represented in form, will be arranged serially. sure that the cache bitwidth can satisfy TNin . For MNIST MLP we
An address generator will be controlled by the overall controller choose 32 32 × 32 MLABs for intermediate cache, and for Cifar-10
in order to provide the exact weight. For MNIST MLP and Cifar-10 ConvNet and AlexNet, we take 384 32 × 32 MLABs.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 8
Quantization strategies for parameters other than C/F weights (MNIST & Cifar-10),
MNIST 1 −24.3146 22.2333 (0,0) −139.1505 148.0735 (6,3) 1 10 (4,0) −3.0121 3.0051 (4,−1)
2 −29.2907 34.2132 (0,0) −156.4832 166.0365 (6,3) 0 10 (4,0) −3.1457 3.1249 (4,−1)
3 −26.6156 35.4796 (0,0) −135.1257 139.6590 (7,2) 2 11 (4,0) −2.6029 2.6408 (2,1)
4 −0.1449 0.0467 (0,0) −44.5449 39.5620 (2,5)
Cifar-10 1 −0.9777 0.9976 (0,0) −0.9788 0.9983 (4,−3) 1 2 (1,0) −1 0.9998 (8,−6)
2 −0.9990 0.9998 (0,0) −42.0525 112.1753 (6,2) 6 7 (1,0) −0.7739 0.5044 (5,−4)
3 −0.9987 0.9998 (0,0) −97.7165 76.5325 (5,3) 6 6 (0,0) −0.9993 0.9991 (7,−6)
4 −1 0.9923 (0,0) −78.6930 190.5350 (6,3) 5 7 (2,0) −0.9661 0.6624 (6,−5)
5 −0.9968 0.9964 (0,0) −155.4622 172.6664 (7,2) 6 7 (1,0) −1 1 (5,−3)
6 −0.9862 1 (0,0) −29.3337 270.3043 (6,4) 5 8 (2,0) −0.9983 0.9684 (5,−4)
7 −1 1 (0,0) −798.1793 761.8890 (7,4) 4 8 (3,0) −1 1 (5,−3)
8 −1 1 (0,0) −131.6111 144.3177 (4,5) −6 7 (4,0) −1 1 (4,−2)
9 −0.2266 0.1051 (0,0) 108.3246 203.2282 (4,5)
Table 9
Model classification accuracy and size comparisons among original, binarized and quantized ones.
We evaluate the performance of our accelerator system in this We train the models on an IBM x3650 M4 server equipped
section. Environment setup and NN model preparation will be in- with an NVIDIA Tesla K40 (28 nm feature size, 2880 CUDA cores
troduced first. We target CNN models to train for a binarized ver- with 12 GB GDDR5 external memory) and a K80 GPU (28 nm fea-
sion, and apply quantization to the parameters to further compress ture size, 4992 CUDA cores with 24 GB GDDR5 external memory)
the model. Then we map the optimized BNN models onto FPGA, card, and use both cards to accelerate the training process. The
and provide performance analysis comparing FP-BNN with general evaluation system is built on Maxeler’s MPC-X20 0 0 platform [45].
purpose processors and other FPGA designs. The system has 8 dataflow engines (DFEs), each comprising a sin-
gle Altera Stratix-V 5SGSD8 FPGA (28 nm feature size) connected
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
We use Torch 7 framework [46] to train the NN models for We implement binarized models for the Xeon E5-2640 CPU, the
MNIST and Cifar-10 based on Hubara’s BNN framework [18] and NVIDIA Tesla K40 GPU and the Altera Stratix-V FPGA. Performance
for AlexNet based on Rastegari’s XNOR-Net framework [20]. The is measured as shown in Table 11. To feed CPU and GPU with
MNIST dataset is a permutation-invariant version consists of 60 K enough data, we take batch size to be identical to training for for-
examples of 28 × 28 gray level digit images for training and 10 K ward propagation. As we can see, with about an order of magni-
examples for testing. The Cifar-10 dataset consists of 50 K exam- tude slower clock frequency and much lower power consumption,
ples of 32 × 32 RGB colour images in 10 classes for training and our accelerator still gets an average speed-up of 314.07 times over
10 K examples for testing, and global contrast normalization and CPU and 19.08 times over GPU for MNIST, 51.83 times over CPU
ZCA whitening are used in the same way as Goodfellow et al. and 5.07 times over GPU for Cifar-10, and 11.67 times over CPU
[47] and Lin et al.[48] did. The ImageNet dataset consists of 1.2 M and 2.72 times over GPU for ImageNet. Peak speed-ups can reach
images from 1 K categories and 50 K images for testing, and a cen- 705.19 times over CPU and 70.75 times over GPU. Although the
ter crop of 224 × 224 is extracted for forward propagation. Adam model has been compressed for about 32 times, the low-precision
[49] learning rule is adopted for training, with a mini-batch size of operations can exploit the potential of fine-grained parallelism in
10 0, 20 0 and 800. The binarization method used here is determin- FPGA, which can offer higher performance than CPUs and GPUs. If
istic [50] considering the convenience for implementing hardware we take energy efficiency as the criterion, with similar feature size,
for inference. Model accuracies are measured and presented in the FPGA implementation can offer an efficiency of two to three
Table 9. As we can see, even the quantized versions for MNIST and orders of magnitude of CPU’s and GPU’s.
Cifar-10 keep high accuracies close to the state-of-the-art results. We take another comparison with some previous FPGA acceler-
XNOR-Net based AlexNet for our design suffers from a 13% accu- ator designs for CNN and BNN models, as listed in Table 12. We can
racy drop, while supporting state-of-the-art performance among see that our FP-BNN reaches a TOP/s speed which is significantly
the existing BNN solutions for ImageNet. faster than the previous CNN designs. For some designs (such as
that in [15]), one major problem is that for memory-centric FC lay-
ers, the data and parameter loading time is much longer than the
computing time as the number of input ports for data and weight
7.3. Hardware implementation RAMs is limited to 8, while in our design all the computing chan-
nels can be fed with data and weights in parallel. FP-BNN is also
We use MaxCompiler to generate the executable bit-stream for much faster than the most recent BNN design [40]. Although our
FPGA, which takes Altera Quartus II v13.1 to synthesize, place and design involves a large FPGA, the power efficiency is also 10 times
route the designs. The resource utilization of the final implementa- better. Another BNN design FINN [41] reaches a performance sim-
tion is shown in Table 10. The design can be driven with an achiev- ilar to ours. For the MNIST case, FINN has taken a smaller MLP
able 150 MHz clock. We notice that the utilization of DSP blocks is network in which the input dimension is larger than the number
not high, since only a small portion of arithmetic operations needs of neurons in each layer, which results in a higher resource utiliza-
floating-point multipliers. tion after task tiling. If we are prepared to reduce model accuracy
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
Table 11
Performance analysis among Xeon E5-2640 CPU, NVIDIA Tesla K40 GPU and Maxeler MAX4 (Stratix V) FPGA systems.
CPU NVIDIA Tesla K40 GPU Maxeler MPC-X20 0 0 with Stratix V FPGA
Core clock (MHz) 2.5 K (Base) / 3 K (Boost) 745 (Base) / 810 & 875 (Boost) 150
Model Macro layer Ops Time (ms) Perf (GOP/s) Time (ms) Perf (GOP/s) Time (ms) Perf (GOP/s) Speedup to CPU Speedup to GPU
MNIST 1(FC) 3.21 M 17.54 18.32 1.08 298.63 1.97 × 10−3 1633.89 89.19x 5.47x
2(FC) 8.39 M 44.28 18.95 2.57 326.61 6.87 × 10−4 12219.40 644.91x 37.41x
3(FC) 8.39 M 43.98 19.08 2.57 326.61 6.87 × 10−4 12219.40 640.51x 37.41x
4(FC) 40.96 K 0.77 5.33 0.26 15.76 5.33 × 10−5 768.19 144.17x 48.75x
Total 20.04 M 106.57 18.80 6.47 309.48 3.39 × 10−3 5904.40 314.07x 19.08x
Cifar-10 1(CONV) 7.08 M 19.70 71.85 4.38 323.20 4.10 × 10−2 172.61 2.40x 0.53x
2(CONV) 301.99 M 355.74 169.78 31.0 1947.69 2.74 × 10−2 11040.33 65.03x 5.67x
3(CONV) 151.00 M 140.85 214.40 14.7 2059.96 1.37 × 10−2 11021.55 51.41x 5.35x
4(CONV) 301.99 M 283.23 213.25 27.8 2170.09 2.05 × 10−2 14712.09 68.99x 6.78x
5(CONV) 151.00 M 146.28 206.44 16.2 1858.86 1.03 × 10−2 14678.75 71.10x 7.90x
6(CONV) 301.99 M 297.53 203.00 30.7 1965.06 1.71 × 10−2 17646.50 86.93x 8.98x
7(FC) 16.78 M 104.95 31.97 7.00 479.38 1.01 × 10−3 16667.13 521.27x 34.77x
8(FC) 2.10 M 12.35 33.98 0.92 455.14 2.60 × 10−4 8069.91 237.48x 17.73x
9(FC) 20.48 K 0.64 6.45 0.33 12.27 6.67 × 10−5 307.35 47.62x 25.05x
Total 1.23 G 1361.28 181.29 133 1853.87 1.3 × 10−1 9396.41 51.83x 5.07x
AlexNet 1(CONV)a 211.83 M 790.69 213.31 345.68 487.91 9.98 × 10−1 211.19 0.99x 0.43x
2(CONV) 895.80 M 2125.08 337.23 186.57 3841.23 5.84 × 10−2 15347.72 45.51x 4.00x
3(CONV) 299.04 M 1715.52 139.45 127.28 1879.54 2.03 × 10−2 14711.75 105.50x 7.83x
4(CONV) 448.56 M 1460.71 245.67 165.95 2162.39 2.71 × 10−2 16560.22 67.41x 7.66x
5(CONV) 299.04 M 1346.86 177.62 108.86 2197.61 1.81 × 10−2 16545.97 93.15x 7.53x
6(FC) 75.50 M 1640.79 36.81 168.97 357.44 4.31 × 10−3 17503.28 475.50x 48.97x
7(FC) 33.55 M 1229.86 21.83 123.40 217.54 2.18 × 10−3 15391.94 705.19x 70.75x
8(FC)a 8.19 M 499.78 13.66 30.00 218.40 2.74 × 10−2 298.47 21.85x 1.37x
Total 2.27 G 10789.29 168.35 1256.72 722.68 1.16 1963.96 11.67x 2.72x
a
The weights of these layers are quantized to 8-bit.
Table 12
Performance comparison with former FPGA-based CNN accelerator designs.
FPGA’16 [51] FPGA’16 [15] FPL’16 [32] FPGA’17 [40] FPGA’17 [41] This work
Precision (bit) 8–16 16 16 Input: 8 Input: 8 weight: 1 Input: 8 weight: 1 (8 for the first and last layer of
weight: 1 AlexNet) others: 2–8 (MNIST & Cifar-10), 32
(AlexNet)
Model size 30.9 G 30.76 G 1.45 G 1.24 G MNIST Cifar-10 MNIST Cifar-10 AlexNet
(OPs)
Performancea 117.8 136.97 (O) 565.94 207.8 (O) 9085.67 2465.5 5905.40 (O) 9396.41 (O) 1963.96 (O)
(GOP/s) 187.80 (C) 318.9 (C) 12219.40 (P) 17646.50 (P) 17503.28 (P)
1.20 (F)
Efficiency 4.57 14.22 22.15 44.2 402.02 210.72 225.36 (O) 358.64 (O) 74.96 (O)
(GOP/s/W) 466.39 (P) 673.53 (P) 668.06 (P)
a
O = Overall, P = Peak, C = CONV, F = FC.
for a smaller network, the overall performance should get closer overall throughput. Furthermore, we can exploit the heterogeneity
to the peak value (12 TOP/s). For the Cifar-10 case, our CONV-Net of logic elements in FPGAs, such as introducing different bit-width
model can achieve a throughput of almost 4 times of FINN’s. We choices together with binarized data for better use of DSP multi-
also support large datasets for our FP-BNN design, which proves pliers.
the compatibility of our design method with various CNN models. This implementation shows that it is promising to implement
BNN models especially for an embedded system, which can offer a
competitive speed and accuracy with low power consumption. Re-
7.5. Discussion cently, various designs [36,52] have shown that more complicated
NN models can also be binarized with tolerable loss of accuracy.
There is considerable scope for improvement in FP-BNN espe- Considering the similarity of component layers and logic genera-
cially for the first layers, since datapath utilization is low due to tion algorithms, it is feasible to implement these models layer-by-
the limited number of input channels. Moreover, the utilization of layer in a sequential way as long as there is sufficient amount of
DSP blocks is low, and more DSP blocks can be involved if they on-chip memory for parameters.
can effectively support low-bandwidth operations to enhance the
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
8. Conclusion [16] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural net-
works with pruning, trained quantization and huffman coding, arXiv preprint
arXiv:1510.00149 (2015).
This paper presents FP-BNN – our design for binarized neural [17] F.N. Iandola, M.W. Moskewicz, K. Ashraf, S. Han, W.J. Dally, K. Keutzer,
networks targeting FPGA technology. Based on the RAMA analy- Squeezenet: alexnet-level accuracy with 50x fewer parameters and less than
sis method, we design a 64-channel accelerator architecture, which 1MB model size, arXiv preprint arXiv:1602.07360 (2016).
[18] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neu-
can accommodate both CONV and FC type layers. An XNOR-based ral networks: training deep neural networks with weights andactivations con-
method is introduced for binarized vector MAC operations, and strained to +1 or −1, arXiv preprint arXiv:1602.02830 (2016).
the summing up process is achieved with a popcount compres- [19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A.Y. Ng, Reading digits in
natural images with unsupervised feature learning, in: Proceedings of the NIPS
sor tree which can be automatically generated. For small mod-
Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
els like MNIST MLP and Cifar-10 ConvNet, shift-based normaliza- [20] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, Xnor-net: imagenet classifi-
tion is introduced which largely reduces the cost of multipliers. cation using binary convolutional neural networks, European Conference on
Computer Vision, Springer International Publishing (2016) 525–542.
With proper dynamic quantization to the input and parameters,
[21] V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, Efficient processing of deep neural net-
the model keeps good performance with the weights binarized works: a tutorial and survey, arXiv preprint arXiv:1703.09039 (2017).
and other parameters compressed by over 10 times. Optimized on- [22] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neu-
chip data storage is managed with parameter quantization. Our ral network architectures and their applications, Neurocomputing 234 (2017)
11–26.
implementation on Maxeler MPC-X20 0 0 platform (with Stratix-V [23] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by
5SGSD8 FPGA) shows a promising TOP/s speed with only 26.2 W reducing internal covariate shift, International Conference on Machine Learn-
power at 150 MHz clock frequency. We expect expect enhanced ing (2015) 448–456.
[24] A. Ng, J. Ngiam, C. Foo, Y. Mai, C. Suen, Backpropagation algorithm of ufldl
accuracy in future binarized models, which should greatly extend tutorial, http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm.
their range of applications. [25] G. Hinton, Neural Network for Machine Learning, Coursera, 2012.
[26] C. Farabet, C. Poulet, J.Y. Han, Y. LeCun, CNP: an FPGA-based processor for
Acknowledgement convolutional networks, in: Proceedings of the nineteenth International Con-
ference on Field Programmable Logic and Applications (FPL), IEEE, 2009,
pp. 32–37.
The support of Maxeler University Programme, Altera, In- [27] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, Diannao: a smal-
tel, UK EPSRC (EP/P010040/1, EP/L00058X/1, EP/L016796/1 and l-footprint high-throughput accelerator for ubiquitous machine-learning, in:
Proceedings of the ACM Sigplan Notices, vol. 49, ACM, 2014a, pp. 269–284.
EP/N031768/1), the European Union Horizon 2020 Research and In-
[28] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun,
novation Programme under grant agreement number 671653, and et al., Dadiannao: a machine-learning supercomputer, in: Proceedings of the
the HiPEAC NoE is gratefully acknowledged. fourty seventh Annual IEEE/ACM International Symposium on Microarchitec-
ture, IEEE Computer Society, 2014b, pp. 609–622.
References [29] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S.
Bhatia, N. Boden, A. Borchers, et al., In-datacenter performance analysis of a
[1] Y. LeCun, C. Cortes, C. J. Burges, The MNIST database of handwritten digits, tensor processing unit, arXiv preprint arXiv:1704.04760 (2017).
1998, http://yann.lecun.com/exdb/mnist/. [30] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep
[2] A. Krizhevsky, V. Nair, G. Hinton, The CIFAR-10 dataset, 2014, https://www.cs. neural networks, in: Advances in Neural Information Processing Systems, 2016,
toronto.edu/∼kriz/cifar.html. pp. 2074–2082.
[3] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- [31] T.-J. Yang, Y.-H. Chen, V. Sze, Designing energy-efficient convolutional neural
thy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition chal- networks using energy-aware pruning, arXiv preprint arXiv:1611.05128 (2016).
lenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252. [32] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang, A high performance FPGA-based
[4] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van- accelerator for large-scale convolutional neural networks, in: Proceedings of
houcke, P. Nguyen, T.N. Sainath, et al., Deep neural networks for acoustic mod- the Twenty Sixth International Conference on Field Programmable Logic and
eling in speech recognition: the shared views of four research groups, IEEE Applications (FPL), IEEE, 2016, pp. 1–9.
Signal Process. Mag. 29 (6) (2012) 82–97. [33] P. Gysel, Ristretto: Hardware-oriented approximation of convolutional neural
[5] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, networks, arXiv preprint arXiv:1605.06402 (2016).
M. Chrzanowski, A. Coates, G. Diamos, et al., Deep speech 2: end-to-end [34] F. Li, B. Zhang, B. Liu, Ternary weight networks, arXiv preprint arXiv:1605.
speech recognition in English and Mandarin, International Conference on Ma- 04711 (2016).
chine Learning (2016) 173–182. [35] C. Zhu, S. Han, H. Mao, W.J. Dally, Trained ternary quantization, arXiv preprint
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, arXiv:1612.01064 (2016).
M. Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint [36] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, Y. Zou, Dorefa-net: training low
arXiv:1312.5602 (2013). bitwidth convolutional neural networks with low bitwidth gradients, arXiv
[7] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, preprint arXiv:1606.06160 (2016).
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Master- [37] H. Alemdar, N. Caldwell, V. Leroy, A. Prost-Boucle, F. Pétrot, Ternary neural net-
ing the game of go with deep neural networks and tree search, Nature 529 works for resource-efficient ai applications, arXiv:1609.00222 (2016).
(7587) (2016) 484–489. [38] W. Meng, Z. Gu, M. Zhang, Z. Wu, Two-bit networks for deep learning
[8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con- on resource-constrained embedded devices, arXiv preprint arXiv:1701.00485
volutional neural networks, in: Proceedings of the Advances in neural Infor- (2017).
mation Processing Systems, 2012, pp. 1097–1105. [39] R. Andri, L. Cavigelli, D. Rossi, L. Benini, YodaNN: an architecture for ultra-low
[9] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing hu- power binary-weight cnn acceleration, IEEE Trans. Comput. Aided Des. Integr.
man-level performance on imagenet classification, in: Proceedings of the IEEE Circuits Syst. PP (2017) 1–14.
International Conference on Computer Vision, 2015a, pp. 1026–1034. [40] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta,
[10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Z. Zhang, Accelerating binarized convolutional neural networks with soft-
in: Proceedings of the IEEE conference on computer vision and pattern recog- ware-programmable fpgas, in: Proceedings of the ACM/SIGDA International
nition, 2016, pp. 770–778. Symposium on Field-Programmable Gate Arrays, ACM, 2017, pp. 15–24.
[11] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, N. Andrew, Deep learning [41] Y. Umuroglu, N.J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, K. Vis-
with COTS HPC systems, in: Proceedings of the thirtieth International Confer- sers, Finn: a framework for fast, scalable binarized neural network inference,
ence on Machine Learning, 2013, pp. 1337–1345. in: Proceedings of the ACM/SIGDA International Symposium on Field-Pro-
[12] NVIDIA, Tesla K40 GPU Active Accelerator, NVIDIA, 2013. grammable Gate Arrays, ACM, 2017, pp. 65–74.
[13] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, Y. LeCun, Neuflow: [42] M. Kumm, P. Zipf, Pipelined compressor tree optimization using integer linear
a runtime reconfigurable dataflow processor for vision, in: Proceedings of the programming, in: Proceedings of the twenty fourth International Conference
IEEE Computer Society Conference on Computer Vision and Pattern Recogni- on Field Programmable Logic and Applications (FPL), IEEE, 2014, pp. 1–8.
tion Workshops (CVPRW), IEEE, 2011, pp. 109–116. [43] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with lim-
[14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, Optimizing FPGA-based ac- ited numerical precision, CoRR 392 (2015). abs/1502.02551
celerator design for deep convolutional neural networks, in: Proceedings of [44] D. Williamson, Dynamically scaled fixed point arithmetic, in: Proceedings of
the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, the IEEE Pacific Rim Conference on Communications, Computers and Signal
ACM, 2015, pp. 161–170. Processing, 1991., IEEE, 1991, pp. 315–318.
[15] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, [45] Maxeler, MPC-X series, https://www.maxeler.com/products/mpc-xseries/.
et al., Going deeper with embedded FPGA platform for convolutional neural [46] R. Collobert, K. Kavukcuoglu, C. Farabet, Torch7: A matlab-like environment
network, in: Proceedings of the ACM/SIGDA International Symposium on Field- for machine learning, in: Proceedings of the NIPS Workshop on BigLearn, in:
-Programmable Gate Arrays, ACM, 2016, pp. 26–35. EPFL-CONF-192376, 2011.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046
JID: NEUCOM
ARTICLE IN PRESS [m5G;October 28, 2017;13:0]
[47] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A.C. Courville, Y. Bengio, Maxout Leibo Liu received the B.S. degree in electronic engineer-
networks., ICML (3) 28 (2013) 1319–1327. ing from Tsinghua University, Beijing, China, in 1999 and
[48] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint arXiv:1312.4400 the Ph.D. degree in Institute of Microelectronics, Tsinghua
(2013). University in 2004. He now serves as an Associate Profes-
[49] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint sor in Institute of Microelectronics, Tsinghua University.
arXiv:1412.6980 (2014). His research interests include Reconfigurable Computing,
[50] M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: training deep neural net- Mobile Computing and VLSI DSP.
works with binary weights during propagations, in: Proceedings of the Ad-
vances in Neural Information Processing Systems, 2015, pp. 3123–3131.
[51] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, Y. Cao,
Throughput-optimized opencl-based FPGA accelerator for large-scale convolu-
tional neural networks, in: Proceedings of the ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, ACM, 2016, pp. 16–25.
[52] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Quantized neural Wayne Luk received the M.A., M.Sc., and D.Phil. Degrees
networks: Training neural networks with low precision weights and activa- in Engineering and ComputingScience from the University
tions, arXiv preprint arXiv:1609.07061 (2016). of Oxford, Oxford, U.K. He is a Professor of Computer En-
gineering with Imperial College London, London, U.K. He
Shuang Liang received the B.S. degree from the Institute was a Visiting Professor with Stanford University, Stan-
of Microelectronics, Tsinghua University, Beijing, China, in ford, CA, USA. His current research interests include the-
2011. He is working toward the Ph.D. degree at the In- ory and practice of customizing hardware and software
stitute of Microelectronics, Tsinghua University, Beijing, for specific application domains, such as multimedia, net-
China. He was a visiting scholar at the Department of working, and finance.
Computing, Imperial College London, UK in 2016. His re-
search interests include reconfigurable computing, hard-
ware acceleration of machine learning algorithms and dis-
tributed systems.
Please cite this article as: S. Liang et al., FP-BNN: Binarized neural network on FPGA, Neurocomputing (2017),
https://doi.org/10.1016/j.neucom.2017.09.046