0% found this document useful (0 votes)
4 views

FPGA Design for Object Detection

This paper presents a high-throughput and power-efficient FPGA implementation of the YOLO CNN for object detection, achieving a throughput of 1.877 TOPS while consuming 18.29 W. The design utilizes binary weights and low-bit activation to minimize off-chip memory accesses, enhancing performance and power efficiency. The proposed architecture demonstrates a mean Average Precision (mAP) of 64.16% on the PASCAL VOC 2007 dataset, showcasing its effectiveness compared to previous implementations.

Uploaded by

ammuniranjanav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

FPGA Design for Object Detection

This paper presents a high-throughput and power-efficient FPGA implementation of the YOLO CNN for object detection, achieving a throughput of 1.877 TOPS while consuming 18.29 W. The design utilizes binary weights and low-bit activation to minimize off-chip memory accesses, enhancing performance and power efficiency. The proposed architecture demonstrates a mean Average Precision (mAP) of 64.16% on the PASCAL VOC 2007 dataset, showcasing its effectiveness compared to previous implementations.

Uploaded by

ammuniranjanav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

A High-Throughput and Power-Efficient FPGA


Implementation of YOLO CNN for Object
Detection
Duy Thanh Nguyen, Tuan Nghia Nguyen, Hyun Kim, Member, IEEE, and Hyuk-Jae Lee, Member, IEEE

detection owing to the support of powerful computation devices


Abstract— Convolutional neural networks (CNNs) require such as GPU. Therefore, several promising approaches have
numerous computations and external memory accesses. Frequent been proposed for object detection with deep learning such as
accesses to off-chip memory cause slow processing and large Single-Shot-Multibox-Detection (SSD) [1], Faster R-CNN [2],
power dissipation. For real-time object detection with high and YOLO [3]. YOLO performs one of the best trade-offs
throughput and power efficiency, this paper presents a Tera-OPS
between the accuracy and the speed for object detection. It is a
streaming hardware accelerator implementing a YOLO (You-
Only-Look-One) CNN. The parameters of the YOLO CNN are single neural network that predicts the object bounding boxes
retrained and quantized with PASCAL VOC dataset using binary and class probabilities in a single evaluation.
weight and flexible low-bit activation. The binary weight enables Although GPU is widely used for processing deep learning
storing the entire network model in Block RAMs of a field- algorithms such as YOLO, it becomes inefficient in
programmable gate array (FPGA) to reduce off-chip accesses optimization such as the selection of the width of data bit and
aggressively and thereby achieve significant performance scheduling data access by the external memory. Therefore,
enhancement. In the proposed design, all convolutional layers are extensive research has been conducted to design a deep learning
fully pipelined for enhanced hardware utilization. The input image accelerator for Application-Specific Integrated Circuit (ASIC)
is delivered to the accelerator line by line. Similarly, the output
and a Field-Programmable Gate Array (FPGA) to address this
from previous layer is transmitted to the next layer line by line.
The intermediate data are fully reused across layers thereby challenge. FPGAs have been widely used for high-efficient
eliminating external memory accesses. The decreased DRAM deep learning owing to their flexible design and short
accesses reduce DRAM power consumption. Furthermore, as the development cycles. Several implementations use floating-
convolutional layers are fully parameterized, it is easy to scale up point representation that has a large computation cost [4]-[6].
the network. In this streaming design, each convolution layer is Recent works have demonstrated that a floating-point
mapped to a dedicated hardware block. Therefore, it outperforms representation is unnecessarily redundant [7], and that CNNs
the “one-size-fit-all” designs in both performance and power can be re-trained and quantized to a very low-bit precision (1-
efficiency. This CNN implemented using VC707 FPGA achieves a bit or 2-bit) without significant loss of accuracy [8]-[10]. The
throughput of 1.877 TOPS at 200 MHz with batch processing
quantization enables design of a fast and power-efficient CNN
while consuming 18.29 W of on-chip power, which shows the best
power efficiency compared to previous research. As for object accelerator using an FPGA that stores the entire quantized CNN
detection accuracy, it achieves a mean Average Precision (mAP) of model in its on-chip Block RAMs of tens to hundreds of Mb.
64.16% for PASCAL VOC 2007 dataset that is only 2.63% lower For example, Virtex Ultrascale+ is an FPGA comprising Block
than the mAP of the same YOLO network with full precision. RAMs (up to 500 Mb) arranged in small units, thereby
providing extremely high memory bandwidth and low power
Index Terms—YOLO, Streaming architecture, Binary-weight, compared to a design using a single big SRAM or an off-chip
low-precision quantization, object detection memory. Thus, FPGA combined with low-bit CNN
quantization enables to design a low power accelerator for deep
I. INTRODUCTION networks offering a throughput of the order of TOPS (tera
Object detection is a challenging task in computer vision. operations per second).
Lately, deep learning has been widely adopted in object There are a number of FPGA designs using Vivado HLS
(High-Level Synthesis) [4], [6], [11], [12], and [13]. However,
This research is supported by “The Project of Industrial Technology these are inefficient in terms of both hardware resource and
Innovation” through the Ministry of Trade, Industry and Energy (MOTIE)
(10082585) and Institute for Information & communications Technology
performance. Zhang et al. [4] presents a single processing
Promotion (IITP) grant funded by the Korea government(MSIT) (No.2017-0- engine using a theoretical roofline model to design an
00721-001, Development of intelligent semiconductor technology for vision accelerator for the execution of each layer. However, the
recognition signal processing for vehicle based on multi-sensor fusion) accelerator is found to consume a significant portion of the
D. T. Nguyen, T. N. Nguyen, and H.-J. Lee are with the Inter-University
Semiconductor Research Center, Department of Electrical and Computer
FPGA chip while running at a modest throughput of 61 giga
Engineering, Seoul National University, Seoul 08826, Korea (E-mail: operations per second (GOPS) for a small network of 5 levels.
{thanhnd, nghiant, hyuk_jae_lee}@capp.snu.ac.kr). Designs in [6] and [13] propose a fused-convolutional layer to
H. Kim is with the Department of Electrical and Information Engineering reduce the off-chip accesses by optimizing the intermediate
and the Research Center for Electrical and Information Technology, Seoul
National University of Science and Technology, Seoul 01811, Korea (e-mail:
data between the neighboring layers in a group. Nevertheless,
[email protected]). H. Kim is the corresponding author. the authors report a significantly larger number of Block RAMs
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

(for storing intermediate data) and DSPs (due to additional hardware accelerator processes these hidden layers one by one.
control logic). Similar to [4], the CNN accelerator in [11] Moreover, the first and last layers are run on software causing
optimizes the data path using loop unrolling and tiling for a low frame rate. Lightweight YOLO-v2 is proposed in [22] to
enhanced performance of each layer. The authors also use combine a binary network with support vector machine (SVM)
Vivado HLS to design the CNN accelerator for each layer regression. The authors design a shared streaming binary
which is mapped to the processing engine in their accelerator in convolutional circuit in which each layer is processed
a pipelined manner. However, the entire intermediate feature- sequentially. Although these previous designs succeed in the
maps generated from each layer are stored in a double buffer so speed up by reducing the complexity of the algorithm, they do
that this scheme does not scale well when the CNN becomes not consider the reduction of external memory accesses.
deeper owing to the demand of large buffers. The design in [14] To avoid the frequent off-chip access for intermediate data
also faces the same problem even though it delivers a high or large inter-layer double buffers caused by un-optimized data
performance for the AlexNet network. Another recent work path in previous works, this paper proposes an efficient Tera-
using Vivado HLS in [12] employs the same optimization as OPS streaming architecture design. YOLOv2 network [3] is
proposed in [4]. In addition, the available resources are used for evaluating the performance of the proposed FPGA
partitioned to make multiple convolutional layer processors design in terms of both hardware performance and detection
(CLP) of smaller size rather than a single large CLP. This work accuracy. The network is retrained and quantized using 1-bit
proposes a scheme to decide the number of required CLPs, the weight and flexible low-bit activation. The main contributions
resource partitioning among these CLPs, and a scheduling of this paper are summarized as follows:
algorithm to utilize their concurrent operations effectively. As
the network is not quantized, the intermediate data are generally  A binary weight, flexible low-bit activation, hardware-
too large to be stored in on-chip memory. Hence, all CLPs read centric quantization, and a retraining method for
their inputs and write their outputs to external memory. YOLO CNN are presented. This study shows that even
Consequently, this design requires a very high memory the binary weight and 3-6-bit activation are adequate
bandwidth. Facing a similar problem, an approach in [15] to realize the desired accuracy of object detection. The
presents an RTL compiler to generate an RTL code for each advantages of this quantization are as follows: (i) It
layer of given network. In this design, each layer also reads requires a minimum number of DSPs as the
inputs and writes outputs to a DRAM. Each layer operates convolutional kernel contains only summations; (ii)
sequentially which means that the next layer starts only when Binary weight enables storing the entire network
the current layer finishes its computation. This non-pipelined model in on-chip memory to minimize the off-chip
processing and frequent accesses to external DRAM lower the accesses, thereby enhancing the performance.
processing speed significantly.  A scalable and high accuracy streaming architecture
Unlike a conventional convolution, the Winograd minimal for real-time object detection is proposed. The
filtering algorithm introduced in [16] is employed in [17] and intermediate data are reused to minimize the size of the
[18] to speed up the convolutional computations. In [17], input buffer of each convolution layer while
additional optimizations including loop unrolling and tiling are eliminating the accesses to the off-chip memory. The
proposed to increase the throughput up to 1,382 GOPS for convolutional layers are fully parameterized. Thus, it
AlexNet. With the same filtering algorithm, the design in [18] is easy to change the network structure.
achieves a throughput of 2.94 TOPS for VGG network.  The proposed architecture is implemented, and its
Nevertheless, this design still demands an excessive number of relative merits are highlighted by comparing with
DSPs and LUTs even though the Winograd algorithm reduces previous works. A real-time demo for the object
the number of multipliers significantly. Moreover, the design of detection is also presented.
a single large convolutional layer has an inherent drawback.  The proposed method can be easily extended to the
The authors also report that the performance of the network previous designs as well as YOLO-v2. It can also
decreases as it goes deeper owing to the overhead of data expect considerable enhancement in throughput by
transfer back and forth between the CNN accelerator and solving the off-chip access that suffered in the
external memory. previous designs.
To reduce expensive external memory accesses, the
resolution of number representation is reduced by [19] and [20]. The rest of this paper is organized as follows. Section II
They aggressively quantize the weight and the activation to a introduces the CNN, YOLO-v2, and low-precision network
single bit. The MAC operation is replaced with a low-cost pop- quantization/retraining. Section III presents optimization of
count computation, and the comparator in max-pooling layer is algorithm for the proposed design. The proposed architecture is
implemented by an OR gate. The authors in [20] report the need elaborated in Section IV. The experimental results are shown in
for floating-point number for batch normalization to avoid Section V. Finally, Section VI concludes the paper.
severe degradation of the accuracy. This shows an example that
the performance of binary network is very poor for a II. BACKGROUND
challenging dataset such as ImageNet.
For the implementation of YOLO, several FPGA designs A CNN is typically composed of basic layers: convolution,
have been proposed. Tincy YOLO, presented in [21], uses an normalization, pooling, and fully connected. Considering that
extended version of the design in [19] to offload 12 hidden the focus is on object detection, the fully connected layer is not
layers to programmable logics in Zynq Ultrascale+ FPGA. The discussed in this paper.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

A. Conventional CNN N-Channel input Image


M N-Channel
1) Convolutional layer Filters
N
The convolutional layer is used to extract higher features N
from the input image. The convolutional computation is H L
+
depicted in Fig. 1. The input image, comprising N channels, is K
1 +
+

convolved with M number of N-channel filters to produce M- K 1. Dot H 2. L


channel output image. Each kernel has a size of K × K. product Accumulation M output
Algorithm 1 elaborates the convolutional operation in detail. channels
N
For simplicity, the stride is assumed to be 1, and the bias is
assumed to be 0. As a result, the output image has the same size
K
as the input image. M
The number of operations for a convolutional layer can be K
calculated as: NOPS = 2×K×K×M×N×H×H. The constant Fig. 1. The convolutional computation.
value 2 implies that each MAC needs a multiplication and an
addition. Algorithm 1: Pseudo code for original convolution layer
in[N][H][H]: input images (N channels)
2) Max-pooling layer W[M][N][K][K]: weight
The max-pooling layer is used to reduce the size of the out[M][H][H]: output images (M channels)
feature-maps, thereby reducing the amount of computation in for oc = 0; oc < M; oc++ do
the network, and control the overfitting. The original max- for r = 0; r < H; r++ do
for c = 0; c < H; c++ do
pooling computation is explained in Algorithm 2. The size of
for ic = 0; ic < N; ic++ do
the output image from max-pooling layer is a half of that of the for i = 0; i < K; i++ do
input image. for j = 0; j < K; j++ do
out[oc][r][c] += W[oc][ic][i][j] * in[ic][r+i][c+j];
3) Batch normalization
Batch normalization [23] has proven to be efficient in
training the CNN. It helps the training to converge faster and Algorithm 2: Pseudo code for original 2x2 max-pooling layer with
prevent the network from overfitting. With batch normalization, stride = 2
the output of each convolutional layer is normalized to reduce in[N][2*H][2*H] : input images (N channels)
the internal covariate shift. It is essential for both training and out[N][H][H] : output images (N channels)
for r = 0; r < H; r++ do
inference phases.
for c = 0; c < H; c++ do
The original batch normalization is as below: for ic = 0; ic < N; ic++ do
out[ic][r][c] = max(in[ic][2*r][2*c],
 (i ) (act   (i ) ) in[ic][2*r+1][2*c], in[ic][2*r][2*c+1],
y   (i ) (1) in[ic][2*r+1][2*c+1]);
[ ]  
(i ) 2

where y and act are the outputs of batch-normalization and (𝑖)


act = (x ⨂ 𝑤 𝑏(𝑖) ) × 𝜇𝑊 = 𝑥𝑊(𝑖) × 𝜇𝑊
(𝑖)
(3)
convolutional computation, respectively. 𝜇 (𝑖) , [𝜎 (𝑖) ]2 are
channel-wise mean and variance of activations, respectively. (𝑖)
where 𝜇𝑊 is the mean of weights of ith channel, and x is the
𝛾 (𝑖) and 𝛽 (𝑖) are the channel-wise scale and bias, respectively.
input activation that is convolved with ith weight kernel.
B. Quantization of CNN using binary weight and low-bit 2) Uniform quantization for activation
activation Activations are quantized and represented by a fixed number
1) Binary weight of bits. Provided the number of bits and quantization step s, the
Weights of each kernel are represented by only two values as quantized values are computed by (6).
shown below:
𝑞𝑚𝑎𝑥 = 𝑠 × (2𝑛−1 − 0.5) (4)

1 if w  0
(i )
j 𝑥
wbj ( i )   (2) 𝑝(𝑥) = 𝑠 × (𝑟𝑜𝑢𝑛𝑑 ( + 0.5) − 0.5) (5)
 1 if w j  0
(i ) 𝑠

𝑞𝑚𝑎𝑥 𝑖𝑓 𝑝(𝑥) > 𝑞𝑚𝑎𝑥
where Wjb(i) is the binary weight, and Wj(i) is the original weight 𝑞(𝑥) = {−𝑞𝑚𝑎𝑥 𝑖𝑓 𝑝(𝑥) < −𝑞𝑚𝑎𝑥 (6)
value jth in ith kernel. For binary weight network, the 𝑝(𝑥) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
convolutional layer is formulated as below:
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

III. ALGORITHMIC OPTIMIZATION FOR THE PROPOSED GOP), the number of bits for activation ranges from 3 to 6 bits.
STREAMING ARCHITECTURE The chosen step size is a power-of-two value so that the
A. Hardware-centric quantization quantization requires only shift operations instead of
multiplications. It should be noted that this quantization is a
This study presents a method to train a low-precision model non-zero scheme (symmetric quantization). There is no zero
for the proposed streaming hardware accelerator. Previous value in the quantized output (i.e., quantized value can
studies in [8] and [20] show that the last layer is highly sensitive
be±1, ±3, …). The zero-center quantization scheme performs a
to low-precision quantization. Therefore, the weights in this
bit worse than the former scheme. Moreover, it has odd number
layer are quantized to 8-bit fixed point, and the activations are
of quantization levels (i.e., the quantized value can
quantized to 16-bit fixed point to minimize the loss of accuracy
be 0, ±1, ±2, … ). Thus, one quantized level is wasted. The
of quantization. In other layers including the first layer, the
experiments show that symmetric quantization performs better
weights and output activations are quantized to 1-bit and 3-to-
than the zero-centered quantization. The quantized network
6-bit, respectively. It is noteworthy that the input image is in
with 1-bit weight and flexible low-bit activation reduces the
RGB format for the first layer.
model size by approximately 30 ×, and the activation size is
Optimization for batch normalization:
reduced by 5.4×. Moreover, the recent FPGA generations have
To reduce the number of calculations at inference phase, (1)
a rich on-chip SRAM resource. For example, 7 Series VC707
is reformulated as below:
FPGA board has 1030 units of 36-Kb Block RAMs
(𝑖) (𝑖)
(approximately 4.6 MB), and Virtex Ultrascale+ FPGA chip
𝑦 = 𝑥𝑊(𝑖) × 𝛾𝑤 + 𝛽𝑤 (7) includes on-chip memory integration up to 500 Mb. Hence, this
quantization enables storing the entire model of Sim-YOLO-v2
(𝑖) (𝑖)
where 𝛾𝑤 and 𝛽𝑤 are the new scale and bias factors that can (or even much deeper networks) in Block RAMs of FPGA chip.
be computed beforehand: As a result, the off-chip memory accesses are significantly
reduced. Thus, the system performance is boosted, and power
(𝑖)
(𝑖)
𝜇𝑊 ×𝛾 (𝑖) dissipation is reduced. Besides, the quantization also helps
𝛾𝑤 = (8) reduce the hardware cost (by removing expensive
2
√[𝜎 (𝑖) ] +𝜀
multiplications).

(𝑖)
(𝑖)
𝜇𝑊 ×𝜇 (𝑖) B. Data-path optimization for the streaming computation
𝛽𝑤 = − + 𝛽 (𝑖) (9)
2
√[𝜎 (𝑖) ] +𝜀 Algorithm 1 explains the original loop computation for a
convolutional layer. To run it efficiently on a dedicated
hardware with limited resources, the loop computation needs to
As the batch normalization parameter is sensitive to small
be optimized. To solve this problem, the loop reordering and
errors, the new scale and bias factors are quantized to 16-bit
tiling are proposed in [6], [11], [13], and [14]. Nevertheless, the
fixed point value to minimize the accuracy loss. By using (7),
output of the entire intermediate feature-maps from each layer
the hardware for batch normalization requires only one
are stored in Block RAMs. Moreover, designs in [6], [11], and
multiplication and one addition, thereby reducing the data-path
[14] use doubled-buffer to pipeline the computation. This
delay.
scheme does not scale well when the CNN becomes deeper
Leaky Rectified Linear Unit (Leaky ReLU):
because it consumes a large number of Block RAMs. For
𝑥 𝑖𝑓 𝑥 > 0 example, Tiny YOLO v2 has nine convolutional layers, and the
𝑔(𝑥) = { (10) number of feature-maps is 5.8 million. If each feature-map is
𝑎𝑥 𝑖𝑓 𝑥 ≤ 0
quantized to 16-bit, it requires 5.8*2*2=23.2 MB of Block
RAM for the doubled-buffer. To reduce the size of Block RAM,
As compared to ReLU, the leaky ReLU helps prevent the
studies in [4], [12], [18], [20], and [25] save the intermediate
neurons from dying during training, thus, it is more stable. The
data for each layer in off-chip memory. Hence, the frequent off-
leaky coefficient a is chosen as 0.125 empirically for both
chip accesses slow down the computation thereby consuming
training and inference phases to replace floating-point
more power.
multiplication by 3-bit right shift operation.
The target of this paper is to eliminate the off-chip accesses
Flexible low-bit activation quantization:
for intermediate data while minimizing the on-chip SRAM. To
Activations are quantized and represented by a fixed number
achieve this, there should be a data-path optimization to
of bits as shown in equations (4), (5), and (6). The research in
efficiently use the temporary data. As proposed by [4], this
[24] shows that each convolutional layer is quantized using a
paper also uses block-based computations to achieve the trade-
different number of bits while preserving the accuracy of the
off between hardware resource and performance. However,
quantized network. The layers, which have a large number of
scheduling the data movement efficiently in streaming
parameters, seem to be more redundant. Thus, they can be
convolutional computation is further investigated. Fig. 2
quantized using less number of bits. Following this finding, the
presents three scheduling schemes covering all the possibilities
activations from different layers of YOLO CNN are flexibly
of weight reuse. The advantages and drawbacks of each strategy
quantized. For Tiny YOLO-v2 [3] (6.97 GOP) and Sim-YOLO-
are analyzed below.
v2 (Simplified version of YOLO-v2 with 24 layers, 18.95
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

Ti The scheme proposed by this paper is depicted in Fig. 2(c).


H K The streaming process is explained as follows. The input sliding
Block 1 K
cube (i.e., K× K× Ti pixels) slides along the width of the input
M
K Ti image, which is called a row pass. The input sliding cube is

...
H convolved with To weight blocks each time to produce To
H Sliding cube To
To-channel output
temporary output values. These weight blocks are reused for a
Block To
N row pass. These To computations are processed in parallel and
Input feature-maps Weight Blocks saved in the line buffers thereby creating To temporary output
(a) channels. The input sliding cube then shifts Ti channels toward
Ti the end of N-input channels. In the next row pass, new To
K
weight blocks are fetched and convolved with the sliding cube.
H Block 1 H
K The convolutional outputs are accumulated with corresponding
K values from line buffers and then saved in the line buffers. This
...

H H operation repeats Ni = N/Ti times until all the N input channels


Ti Sliding cube are computed. The values in the line buffers at this time are the
Block To M final To output channels which are then forwarded to the next
N
Input feature-maps
To
Weight Blocks Output feature-maps buffer
layer. This computation is performed when all M output
channels are forwarded to the next layer. To finish the
(b)
processing for one line, the entire weight of the model is
H Block 1
K accessed from the memory. For the next row computation, the
K
sliding cube shifts down by one row and then repeats the above
K process. Therefore, to process the whole input feature-maps, the
...

To
Ti Sliding cube
weights are read H times. As the weights are stored in on-chip
H
Block To
SRAM, and the weight prefetching can be applied to hide the
N latency, the weight accesses do not cause system degradation.
Input feature-maps Weight Blocks To line buffers The computation of a convolutional layer is performed when all
(c) lines are processed. It is noteworthy that each row of input
Fig. 2. The scheduling for streaming convolutional layer: (a) No weight reuse. feature-maps is reused K times. Regarding the hardware
(b) Fully weight reuse. (c) Proposed line-based weight reuse, input feature-
maps fully reuse. resource, the input buffer size for pipelining is
(K+1)×N×H×QA, and the temporary accumulation buffer size
In the first strategy in Fig. 2(a), there is no weight reuse. The is To×H×QS.
input sliding cube moves from the beginning toward the end of The partial output from a layer is input directly to the next
the channel dimension. Therefore, each sliding cube is layer without going back to external memory. In addition, it is
convolved with a new weight block. All these values are noteworthy that the partial output parameter To of a layer is the
accumulated to produce a final output. This scheme has the best partial input parameter Ti of the next layer to minimize the cost
locality of the partial sum. Thus, it does not require a temporary of the control logics and Block RAM banks. The parallelism
buffer for the accumulation. The input buffer size is parameters such as Ti and To are chosen to achieve the best
K×N×H×QA, where QA is the bit width of input feature-maps. trade-off between hardware cost and performance.
To overlap the computation between layers, the number of Table I summarizes the three scheduling schemes. It should
buffer rows is increased from K to (K+1). However, the weight be noted that QA is much smaller than QS after quantization. As
model needs to be read H2 times, which is inefficient for a large analyzed above, the first strategy (i.e., no weight reuse) is not
weight model. efficient enough owing to frequent weight read and no input
The second scheme shown in Fig. 2(b) maximizes the weight reuse (i.e., no overlapped sliding windows). The second scheme
reuse which is implemented in [11] and [14]. Each weight is (i.e., full weight reuse) has the best weight reuse among the
reused for the whole input channel (i.e., reuse H2 times). At a three schemes, but it operates with frame-wise computation,
time, Ti input planes are convolved with each of To weight which requires a large inter-layer doubled buffer for pipelining.
blocks. The temporary accumulations are stored in an output In batch mode, the buffer and convolution kernel increase
buffer. The SRAM size of this doubled output buffer is linearly as the batch size increases. The next layer can start only
2×To×H2×QS, where QS is the bit width of the accumulation after the entire output of the previous layer is computed. On the
before quantization. To produce final To output feature-maps, other hand, the proposed scheme requires smaller buffers and
the entire input feature-maps are accessed. Hence, to generate causes a smaller delay between layers (i.e., line delay). The
output feature-maps, the input feature-maps are repeatedly read weight prefetching can be used to hide the latency of the weight
M/To times. Because, the entire input feature-maps are read read. It is also noteworthy that the proposed scheme does not
multiple times, the temporary buffers must be large to store incur hardware resource overhead in batch mode. Therefore, the
them. Moreover, to pipeline between layers, the buffer size proposed line-based weight reuse scheme outperforms the other
should be doubled, which is 2×H2×N×QA. schemes in terms of both hardware cost and performance.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

TABLE I PE. In each PE, the Ti 3 × 3 input data are convolved with
COMPARISON OF THREE SCHEDULING SCHEMES FOR STREAMING
PROCESSING
corresponding 1-bit 3×3 weight kernel as illustrated in Fig. 4(b).
Line-based Then, the 3 × 3 results are summed up using 2-stage ternary
Fully weight
Features No weight reuse
reuse
weight reuse adders (i.e., 3-input adder). The results from each T i kernel are
(proposed) input to a pipelined adder tree. The output data from the adder
Input buffer size (K+1)×N×H×QA 2×H2×N×QA (K+1)×N×H×QA
tree are saved temporarily to buffers. In the next iteration, the
Output buffer
0 2×To×H2×QS To×H×QS next Ti input channels are sent to the PEs, the outputs from
size
Weight read
H2 1 H
adder tree are summed up with corresponding value in buffers,
(times) and then the resulting values are saved to buffers. This iteration
Weight reuse
1 H2 H is performed when N input channels are sent to the PEs, and the
(times)
Relative latency 1 H 2
H values in the temporary buffer are the final convolutional
values. The computation for each line is performed when all M
VC707 FPGA board output channels are calculated. This input line is no longer
needed, hence, it can be replaced by the next line. Thus, this
Batch Max
Control
CONV Norm pooling feature-maps reuse scheme does not require writing back to
YOLO DRAM, and consequently, the memory bandwidth can be
PCIE

& Input buffer


DMA Status
Host
conv1 ... conv17
significantly reduced. The outputs from convolutional
PC
YOLO accelerator
computation are input to the batch normalization module that
Write Input Image & Read Output Input Image Detection includes both quantization and leaky activation as described in
Section III. The output from batch normalization layers are the
DRAM
final quantized values that are concatenated and forwarded to
the next layer. It should be noted that the number of bits to
Fig. 3. The overview of the proposed streaming architecture.
quantize the activations are less than the number of bits to
represent the real value of activations. For example, the
IV. THE PROPOSED STREAMING ARCHITECTURE activations of a layer are quantized to 6-bit and it needs 7 bits
to represent the real value of activations:
A. Overview of Accelerator Architecture
Fig. 3 presents the overall block diagram of the proposed real_value =
𝑠𝑡𝑒𝑝
∗ (2 ∗ 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑑_𝑣𝑎𝑙𝑢𝑒 + 1) (11)
2
design, which is straightforward yet proven to be very efficient.
The aggressively quantized model is stored entirely in Block
This representation saves 1/7th size of SRAM for input line
RAMs. The input to each layer is given line-by-line. Instead of
buffers at the cost of one more addition. The SRAM size for the
a large doubled buffer as proposed in [6], [11], and [14], each
circular buffer is 4×C×N×QA, where C is the width of the input
layer requires only an additional line buffer to overlap the
image. In this streaming design, the partial inputs come to a
computation between layers. In the streaming design, the timing
layer in the same way as they are read out for computation.
optimization of each layer is crucial. As each layer processes a
Therefore, it is efficient to have a single deep SRAM to store
different amount of computation at a different speed, there must
multiple partial-line inputs sequentially, as depicted in Fig. 5.
be synchronization between layers for correct operation. This
This storage scheme utilizes the Block RAMs of FPGA
design uses the handshake mechanism to synchronize the
effectively. The depth and width of SRAM are N/Ti×C and Ti×
operation of all layers. It is noteworthy that the streaming style
QA, respectively. Ti is chosen such that the SRAM width is not
completely eliminates the off-chip access for intermediate
too large, thus, the number of Block RAMs needed for a line
results, which was a limitation of the previous works in [4],
buffer is minimum. Ti and To are parallelism factors chosen to
[12], [18], [20], and [25]. The DRAM is accessed only for input
achieve the best trade-off between performance and hardware
image and final detection results.
cost. It is also noteworthy that the number of bits for the input
B. Streaming Design of the Convolutional Layer and output activations can be flexibly changed during the
Fig. 4 depicts the proposed architecture for a streaming design phase to achieve the best performance.
convolutional layer. It is noteworthy that the kernel size (i.e., For signed addition, this study uses binary adder trees with
3 × 3) can be changed easily as the proposed design is fully pipelined registers added between each stage to achieve a high
parameterized. For better understanding, Fig. 4 is explained in clock speed. The bit width of each pipeline stage is flexibly
conjunction with Fig. 2(c). In Fig. 4(a), the input buffer includes changed to minimize the hardware cost.
4 lines of SRAM (for the case of 3×3 kernel). This additional The computation of convolutional layer has two basic tasks:
buffer enables overlapping of the computation of the current parameter fetching and computation. To increase the speed of
layer and the previous layer. A partial input from the previous computation, the parameters are fetched beforehand. The size
layer is written to a line buffer while the data from other three of the weight buffer is doubled, and it works as a ping-pong
line buffers are sent to To processing engines (PEs) for buffer. The period of each pipeline stage is equal to that of the
computation. As described in Section III, the input sliding cube longest task. This optimization increases the throughput of a
slides along the width of the line buffers to send data to each layer significantly.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

To Processing Elements (PE)


w11 1
x11 6 Signed inverse
Circular buffer Ti kernels Pipelined Adder Tree w12 1 Ternary
Ti Sliding x12 6 Signed inverse
Adder
C cube 3x3 + w13 1
3x3
3x3 x13 6 Signed inverse
Partial kernel + w21 1
input
kernel
kernel + Line buffer x21 6 Signed inverse
Host PC w22 1 Ternary Ternary
x22 6 Signed inverse

Concatenate
Adder Adder
N w23 1
x23 6 Signed inverse
Partial
Block Prefetch output
w31 1
x31 6 Signed inverse
shift quantize
RAMs weights w32 1 Ternary
x32 6 Signed inverse Adder
Batch Norm w33 1
Signed inverse
x33 6

(a) (b)

Fig. 4. The streaming design of a convolutional layer. (a) The architecture of a 3×3 convolutional layer. (b) The design for 1-bit weight and low-bit activation 3x3
Kernel.

The design of the 1×1 convolutional layer is similar to that ×


of the 3×3 layer. The only differences are: (i) the input buffer
requires only two lines of memory; (ii) there is no need of T i
3×3 kernels; instead, the input is multiplied with binary weight
and then input to adder tree.

...

...

...

...
For efficient prefetching of weight in order to speed up the
convolutional computation, the weights in Block RAMs need
to be in a predefined pattern. Fig. 6 describes the proposed
weight memory pattern to support partial convolution. Each
weight block contains weights for T o PEs, each of which
computes the partial convolution for T i input channels. Weight Partial-line inputs
blocks are stored sequentially in memory according to the Fig. 5. The memory storage for a line buffer composed of multiple partial lines.
processing order. Weights from memory are loaded to the
weight buffer block-by-block at consecutive addresses. This
Offset address 0
weight block is reused for each line of the Ti input channels. To
Block start
produce one line of final output, the whole weight for that layer address
is loaded. Hence, each layer needs to load image-width number W[1][1][1:K][1:K]
of times to process the entire image. For layers with a large Ti

...
number of kernels, it does not incur many weight reloads from K PE 1
W[1][Ti][1:K][1:K]
K
Block RAMs as the width of the input image is rather small.
W[2][1][1:K][1:K]
C. Streaming Design of Max-pooling Layer Ti PE 2
...
K
The streaming design of the 2×2 partial input max-pooling K W[2][Ti][1:K][1:K]
layer is depicted in Fig. 7. It is noteworthy that this layer
...

Input
...

requires one line of buffer with a depth of a half of the width of sliding
PE To W[To][1][1:K][1:K]
input image. The partial input from convolutional layer is cube
...

latched and the latched input is then compared to the input. If Block n
Weights
the current row is even (assume that row counts from zero), the W[To][Ti][1:K][1:K] Block 2
Block 1
comparison results are stored in the line buffer. Otherwise (i.e.,
Block-based convolution Weight memory storage
current row is odd), the comparison results are compared one
Fig. 6. Weight memory pattern for efficient weight prefetching.
more time to the value in the corresponding address in the
buffer. The final comparison results are concatenated to
produce the Ti-channel outputs. Ti PEs

D. Resource Aware Parallelism N valid row


The aforementioned parameters T i and To are chosen to C Ti Line buffer
Concatenate

0
Partial
achieve the best performance of streaming architecture in terms row=0
row=1 > > output
of both throughput and hardware cost. As shown in Fig. 2, the Partial input
1

parallelism factor is Ti × To. Larger values of Ti and To can


achieve more throughput in the convolution layer. However, as
the network is very deep, each big layer contributes to the
Fig. 7. The streaming architecture of a 2×2 partial max-pooling layer.
hardware resource substantially. Therefore, Ti and To should be
carefully selected. For a convolutional layer that computes N-
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8

channel inputs to produce M-channel outputs, the number of the number of DSPs in each layer except the last layer is only
repetitions of the partial computation and accumulation to To. As the last layer requires multiplications of weights and
produce a final output is Ni=N/ Ti. In addition, To output input feature-maps, the number of DSPs is Ti×To+To. Given the
channels are computed in parallel. Thus, the number of number of available DSPs, the parallelism parameters of all
repetitions required to compute all output channels is Mo=M/ layers must satisfy the following condition:
To. It is noteworthy that Ti and To are divisors of N and M,
respectively, to avoid a complicated design and underutilization (𝑛)
𝑇𝑜
(𝑛−1)
× 𝑇𝑜
(𝑖)
+ ∑𝑛𝑖=1 𝑇𝑜 ≤ 𝐷𝑆𝑃𝑠𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 (19)
of the computation kernels.
The delay of each block-based computation for each kernel (𝑖)
Where n is the number of layers, and 𝑇𝑜 is the To value of n-
size (i.e., 1×1, 3×3) is given below:
th layer. For each PE in Fig. 4, there are 4×Ti ternary adders and
Ti+1 binary adders. Therefore, in the convolutional kernel in
𝑡1×1 = 8 + log(𝑇𝑖 ) (12)
each layer, there are To × (5×Ti+1) adders. It is noteworthy that
the ternary adder is implemented efficiently using the LUT6
𝑡3×3 = 10 + log(𝑇𝑖 ) (13) and carry chain in the same slice for the Virtex-7 FPGA chip,
thereby saving the area and achieving high frequency [26].
Here, the specific numbers such as 8 and 10 are present owing Consequently, it requires only two cycles to reduce nine inputs
to certain pipeline stages in the convolutional kernel. to a single output while guaranteeing a high speed of 200 MHz.
Hence, the computation time per line for each layer (depends
on the kernel size) is given follows: E. Batch processing
As shown in Fig. 8, the streaming architecture with
𝑁 𝑀
𝑇1×1 = (8 + log(𝑇𝑖 )) × 𝐶 × × (14) pipelining enable the ability to run computation in batch mode
𝑇𝑖 𝑇𝑜
to fully utilize the pipeline processing. Each layer is delayed by
𝑁 𝑀 1-to-4 input lines from the previous layer. Because the network
𝑇3×3 = (10 + log(𝑇𝑖 )) × 𝐶 × × (15) is very deep, the delay from the first layer to the last layer
𝑇𝑖 𝑇𝑜
becomes large. If single frame is processed at a time, the first
Where C is the width of the input image. It should be noted that layers are underutilized for a definite period. The idea of batch
the computation time varies for each group of layer. For processing is that, while the layers at the end of the networks
example, owing to the 2 × 2 max-pooling layer, 2 lines are run the first image, the first layers, that finish running the first
computed in CONV1 to produce 1 line for CONV2. Similarly, image, can start running for the second image to utilize the idle
CONV2 needs to compute 2 lines to produce 1 line for CONV3, period. The batch processing mode increases the throughput
and so on. To achieve the maximum efficiency of pipeline significantly.
processing, the computation time for each layer should be To define the processing time and speedup, the T, D, L, and
similar. It is noteworthy that the Ti parameter of the current n are defined as follows. T is the frame processing time for the
layer is the To value of the previous layer. Therefore, the last layer, D is the delay from the first layer to the last layer, L
parallelism factors for two consecutive convolutional layers is the latency between the last layers of two consecutive frames,
(denoted in layer 1, 2 below) present in different groups should and n is the batch-size. Processing time of a frame for normal
satisfy the following condition: mode is T+D. The processing time for a batch size of n is
D+n×T+(n-1) ×L. The speedup can be derived as below:
𝑁1 𝑀1
2 × (10 + log(𝑇𝑖1 )) × 𝐶1 × ×
𝑇𝑖1 𝑇𝑜1 𝑇+𝐷
𝑁2 𝑀2 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = (20)
≈ (10 + log(𝑇𝑖2 )) × 𝐶2 × × (16) 𝑇+𝐿+(𝐷−𝐿) 𝑛
𝑇𝑖2 𝑇𝑜2

Larger the batch-size results in bigger throughput. It is


Where 𝑀1 = 2𝑁1 = 𝑁2 = 0.5 × 𝑀2 ; 𝐶1 = 2𝐶2 ; 𝑇𝑜1 = 𝑇𝑖2 ; noteworthy that the accelerator with batch-mode support does
Hence, the condition is simplified as: not require any additional hardware resource owing to the
1 1
pipelined processing scheme.
(10 + log(𝑇𝑖1 )) × ≈ (10 + log(𝑇𝑜1 )) × (17)
𝑇𝑖1 𝑇𝑜2
V. EXPERIMENTAL RESULTS
The convolutional layers in the same group process the input A. Low-bit Quantization
images of the same size, and 𝑁1 = 𝑀2 and, 𝑀1 = 𝑁2 .
The quantized network is retrained using Darknet [27] deep
Therefore, the condition is simplified as follows: learning framework. The YOLO CNN is trained using
1 1
PASCAL VOC 2007+2012 dataset and tested using PASCAL
(10 + log(𝑇𝑖1 )) × ≈ (8 + log(𝑇𝑜1 )) × (18) VOC 2007. Table II shows the accuracy of the quantized Tiny
𝑇𝑖1 𝑇𝑜2
YOLO-v2. The quantized network with 1-bit weight and 6-bit
Multiplications are mapped to DSPs in FPGA. Hence, DSPs activation (i.e., 1-b W, 6-b A) incurs accuracy loss of
are used only for the batch normalization module. Therefore, approximately 2.5% compared to the full-precision network.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9

TABLE II
Image 1 Image 2 Image n
ACCURACY (MEAN AVERAGE PRECISION) OF QUANTIZED TINY YOLO CNN
conv1 conv1 conv1 (W: WEIGHT, A: ACTIVATION)
conv2 conv2 conv2 Weight Activation Multiplication
Accuracy
Quantization size size for
conv3 conv3 conv3 (%)
(MB) (MB) convolution
Baseline
...

...

...
53.96 60.53 22.0 Yes
conv17 conv17 conv17 (32-b W, 32-b A)
1-b W, 6-b A 51.44 2.01 4.1 No
Fig. 8. Batch processing. 1-b W, 4-b A 49.12 2.01 2.73 No
1-b W, 3-b A 45.0 2.01 2.05 No
The model size is reduced by 30× and the activation size is
reduced by 5.4×. TABLE III
ACCURACY (MEAN AVERAGE PRECISION) OF QUANTIZED YOLO-V2 AND
To justify the efficiency of the proposed streaming
SIM-YOLO-V2
architecture in computing the deeper networks, the quantization Accuracy Weight Complexity
of a deeper network is performed. This study selects a Networks Quantization
(%) size (MB) (GOP)
simplified version of YOLO-v2 as a baseline for the Full precision 75.88 258 34.9
(1) YOLO -
quantization. This network, so-called Sim-YOLO-v2, inherits 1-b W, 32-b A 71.56 8.1 17.45
v2
the structure of YOLO-v2 except the last layers for multi-scale 1-b W, 6-b A 71.11 8.1 17.45
detection [3]. As the research in [3] points out that the pass- Full precision 72.0 79.74 18.95
(2) Sim-
through layers, that fetch features from an earlier layer to the 1-b W, 32-b A 66.99 2.54 9.48
YOLO-v2
final outputs, enable a modest increase in performance (i.e., 1-b W, 6-b A 65.76 2.54 9.48
Full precision 66.79 58.28 17.18
1%), these pass-through layers are removed to save the (3) Sim- 1-b W, 32-b A 64.95 1.88 8.59
computation budget. Hence, the Sim-YOLO v2 contains 19 YOLO-
v2 FPGA 1-b W, 6-b A 65.07 1.88 8.59
convolution layers and 5 max-pooling layers. Its architecture is
1-b W, 4-to-6-b A 64.16 1.88 8.59
the same as Darknet-19 network in [3]. Table III presents the
accuracy of the quantized networks in YOLO-v2 and Sim-
YOLO-v2 as compared to the full-precision network. TABLE IV
YOLO-V2 NETWORK ARCHITECTURE AND PARALLELISM FACTORS FOR EACH
The binary version of these above networks attains an LAYER
accuracy comparable with that of the full-precision networks, Size / Filter Out bit PF***
Layer Type Input Size Output Size
mean-while, saving a significant amount of computation. These Stride number width (Ti, To)
quantized networks are used for the verification of the 0 C* 3×3 / 1 32 416×416×3 416×416×32 6 (3, 32)
1 M** 2×2 / 2 416×416×32 208×208×32 6 (8, 8)
streaming architecture. It is noteworthy that the deeper 2 C* 3×3 / 1 64 208×208×32 208×208×64 6 (8, 8)
networks perform better in terms of quantization. The binary 3 M** 2×2 / 2 208×208×64 104×104×64 6 N/A
network numbered (3), which has 17 convolution layers and 5 4 C* 3×3 / 1 128 104×104×64 104×104×128 6 (8, 8)
max-pooling layers (removed 2 CONV layers from Sim- 5 C* 1×1 / 1 64 104×104×128 104×104×64 6 (8, 8)
6 C* 3×3 / 1 128 104×104×64 104×104×128 6 (8, 8)
YOLO-v2 (2)), achieves an accuracy of 64.16%, which is 7 M** 2×2 / 2 104×104×128 52×52×128 6 N/A
12.78% higher than that of the quantized Tiny YOLO-v2 in 8 C* 3×3 / 1 256 52×52×128 52×52×256 6 (8, 8)
Table II. In addition, its binary weight size is only 1.88 MB, and 9 C* 1×1 / 1 128 52×52×256 52×52×128 6 (8, 8)
10 C* 3×3 / 1 256 52×52×128 52×52×256 6 (8, 16)
therefore, it can be stored entirely in Block RAMs in an FPGA. 11 M** 2×2 / 2 52×52×256 26×26×256 6 N/A
Even though it is three times more complex than Tiny YOLO- 12 C* 3×3 / 1 512 26×26×256 26×26×512 6 (16, 8)
v2, a high throughput is achieved by virtue of its streaming 13 C* 1×1 / 1 256 26×26×512 26×26×256 6 (8, 16)
architecture. To provide detailed information of the network 14 C* 3×3 / 1 512 26×26×256 26×26×512 6 (16, 8)
15 C* 1×1 / 1 256 26×26×512 26×26×256 6 (8, 16)
topology, the architecture of the implemented network is 16 C* 3×3 / 1 512 26×26×256 26×26×512 4 (16, 8)
described in Table IV. 17 M** 2×2 / 2 26×26×512 13×13×512 4 N/A
18 C* 3×3 / 1 1024 13×13×512 13×13×1024 6 (8, 16)
B. Implementation Result 19 C* 1×1 / 1 512 13×13×1024 13×13×512 4 (16, 8)
20 C* 3×3 / 1 1024 13×13×512 13×13×1024 6 (8, 16)
The analysis in Section IV-D defines the guidelines to 21 C* 1×1 / 1 125 13×13×1024 13×13×125 16 (16, 5)
achieve balanced pipelining. The parallelism factors for each Note: C*=Convolution, M**=Maxpool, PF***= Parallelism Factors
layer are empirically chosen according to these guidelines.
These factors satisfy the FPGA resource constraints and the design on VC707 Evaluation board. The detector can detect 20
target operating frequency (i.e., 200 MHz). The detailed objects in the PASCAL VOC datasets at 30 fps.
parallelism factors for each layer are listed in Table IV. The batch processing increases the throughput significantly
The framework in Fig. 3 is used to verify the operation of the as it improves the utilization of each convolutional layer. The
proposed design. Input images and control commands from the experiments at 200 MHz show that T = 7.5 ms, D = 8.975 ms,
host PC are sent to the accelerator through PCI Express port. and L = 1.43 ms. According to equation (20), the dependence
After the computation, the detection results are sent back to the of the speedup on the batch size is illustrated in Fig. 10. The
host PC for post-processing. Fig. 9 demonstrates the real-time speedup is almost saturated at 1.8 with batch size = 30. Table V
object detection of the YOLO network using the proposed shows the performance of the proposed hardware design with
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10

The performance of the proposed design running the Tiny


YOLO-v2 significantly outperforms the Tincy YOLO
presented in [21] in terms of throughput, accuracy, and power
efficiency. Compared to the performance of Sim-YOLO-v2
running on GTX Titan X GPU in [3], the proposed design with
batch processing exhibits higher throughput (1.24 times) and
much better efficiency in power consumption (11.54 times).
The design in [22] combines the binary CNN (for feature
extraction) with parallel SVM (for refined detection) to achieve
an accuracy of 3.46% higher than the proposed design.
However, the number of classes used for accuracy calculation
for the work in [22] is not clear. The proposed design achieves
3.1 times higher throughput (with a higher resolution) at 1.5
times slower frequency, and 2.68 times better efficiency (GOPS
per LUTs). It is notable that deeper network can result in more
efficient proposed architecture.
Fig. 9. A live demo of the proposed work on VC707 FPGA board. The comparison between the proposed design and other
recent works on CNN hardware implementations is presented
in Table VII. For a fair comparison, both small scale networks
in [19] and [30] and large scale deep networks in [18], [20], and
[29] are selected. In the case of low precision design for small
scale networks, small input size (32×32) and simple network
structure result in high throughput and efficient resource
utilization. Therefore, most of the previous FPGA designs for
binary neural networks (BNNs) are validated using just tiny
dataset such as Cifar-10 or MNIST [19], [20], [30], [31]. To
legitimately compare to the previous works, the convolutional
layers of VGG-16 in this study are quantized using 1-bit weight
and 2-bit activation because this study is not optimized for the
fully binarized network. The low-precision model of VGG-16
runs on the proposed FPGA design to estimate accuracy
Fig. 10. Throughput improvement with batch processing. performance. Consequently, as shown in Table VII, the
proposed design achieves a throughput of 4,420 GOPS with
TABLE V
IMPLEMENTATION RESULTS OF THE PROPOSED DESIGN W/ AND W/O BATCH batch size of 32. In the case of [19], despite the small size of
PROCESSING CNN (i.e., 0.1125 GOP) and validation by small images, the
Performance w/o Performance w/ power efficiency is much lower than the proposed design. In the
Features
batch batch
Device Virtex-7 VC707 FPGA
case of [30], the proposed design outperforms [30] in terms of
Operating frequency 200 MHz efficiency (i.e., throughput per LUTs), despite [30] achieving
Block RAMs (18 Kb) 1144 (55.5%) the highest throughput owing to its BNN structure. However, it
DSPs 272 (9.7%) should be noted that the design in [30] is operated on the
LUTs – FFs 155.2K (51.1%) – 115K (18.9%)
mAP 64.16%
Ultrascale’s family FPGA (20 nm), which is more than two
DRAM bandwidth 47.2 MB/s 84.96 MB/s times larger than the FPGA board used in this work, and the
Frame rate (416 × 416) 60.72 fps 109.3 fps BNN structure in [30] cannot support the large input size owing
Throughput 1043 GOPS 1877 GOPS to its simple structure. It proves that the proposed work
Power 11.11 W 18.29 W
outperforms other short bit-width CNN implementations.
The next comparison is for large scale deep network. The
(batch size = 30) and without batch processing. Owing to the authors in [20] run AlexNet on their own design. The average
intermediate data reuse, the DRAM bandwidth is below 100 throughput is reported as 1,964 GOPS, which is 2.25 times
MB/s for batch processing mode. A low-cost single-data-rate lower than the throughput in this work. Owing to the frequent
SDRAM is sufficient for high performance. The utilization of off-chip accesses, the first layer (i.e., 210 GOPS) is the
DSPs is below 10%. Most DSPs are consumed by the last layer bottleneck of their design. The research in [18] does not count
that requires multiplications between weights and input the first layer of the first group for the average throughput
activations. With batch processing, the throughput is observed owing to similar problems. In this layer (183 MOPs), to achieve
to be as efficient as 1.877 TOPS with a power consumption of the stated throughput of 2,735 GOPS, the corresponding frame
18.29 W. rate is 14,754 fps. It is assumed that the output feature-maps use
Table VI shows the comparison of the proposed design with 8-bit data. The average memory bandwidth is 47 GB/s, which
the previous works about the YOLO hardware implementation. is much higher than the maximum bandwidth (19.2 GB/s) that
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11

TABLE VI
COMPARISON OF THE PROPOSED DESIGN WITH THE PREVIOUS WORKS FOR YOLO CNN HARDWARE
Sim-YOLO-v2 on Tincy YOLO Lightweight YOLO-v2 This work This work
GPU [3] [21] [22] (Tiny YOLO-v2) (Sim-YOLO-v2)
Zynq Ultrascale+ (16 Zynq Ultrascale+ (16 Virtex-7 VC707 (28 Virtex-7 VC707 (28
Platform GTX Titan X (16nm)
nm) nm) nm) nm)
Frequency 1 GHz N/A 300 MHz 200 MHz 200 MHz
BRAMs (18 Kb) N/A N/A 1706 1026 1144
DSPs N/A N/A 377 168 272
LUTs - FFs N/A N/A 135K – 370K 86K – 60K 155K – 115K
CNN Size (GOP) 17.18 4.5 14.97 6.97 17.18
Precision (W, A)(**) (32, 32) (1, 3) (1-32, 1-32) (1, 6) (1, 6)
Image Size 416×416 416×416 224×224 416×416 416×416
Frame rate 88 16 40.81 66.56 109.3
Accuracy (mAP) (%) 66.79 48.5 67.6 51.38 64.16
Throughput (GOPS) 1512 72 610.9 464.7 1877
Efficiency (GOPS/kLUT) N/A N/A 4.52 5.40 12.11
Power (W) 170 6 N/A 8.7 18.29
Power efficiency (GOP/s/W) 8.89 12 N/A 53.29 102.62
Note: (**): W: Weight, A: Activation
TABLE VII
COMPARISON OF THE PROPOSED DESIGN WITH THE PREVIOUS WORKS FOR OTHER CNN HARDWARE
Neurocomputing
FPGA’17 [19] HiPEAC’17 [30] FFCM’17 [18] TVLSI’18 [29] This work
[20]
Kintex Ultrascale Stratix-V Intel Arria 10
Platform Zynq XC7Z045 Zynq Ultrascale+ Virtex-7 VC707
XCKU115 5SGSD8 GX 1150
Frequency (MHz) 200 125 150 200 200 200
BRAMs (18Kb) 186 1814 2210 (*) 1824 2232 (*) 1214
DSPs N/A N/A 384 2520 1518 272
LUTs 46.3K 392.9K 230.9 (*) 600K 138K (*) 104.7K
FFs N/A 348K N/A N/A N/A 140.1K
CNN Size (GOP) 0.1125 1.2 1.45 5 layers of VGG 30.95 30.74
Precision (W, A) (1, 1) (1, 1) (1, 1) (16, 16) (16, 16) (1, 2)
Image Size 32×32 32×32 224×224 224×224 224×224 224×224
Throughput (GOPS) 2463.8 14814 1964.0 2940.7 715.9 4420
Efficiency (GOPS/kLUT) 53.2 37.7 8.51 4.90 5.19 42.22
Power (W) 11.7 N/A 26.2 23.6 N/A 14.72
Power efficiency (GOP/s/W) 210.58 N/A 74.96 124.6 N/A 300.27
Note: (*) for Intel FPGA: Block RAM (20Kb), Logic cells (ALM)

the ZCU102 FPGA board can provide [32]. Whereas, the


proposed design requires a very small memory bandwidth (i.e., REFERENCES
85 MB/s) owing to the streaming design and data path [1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y Fu, A. C.
optimization. Thus, it achieves much higher performance in Berg, “SSD: Single Shot MultiBox Detector”, [Online]. Available:
arxiv.org/abs/1512.02325.
terms of both throughput and power efficiency. In conclusion,
[2] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards Real-Time
considering the trade-off between the hardware resource, CNN Object Detection with Region Proposal Networks”, [Online]. Available:
size, throughput, and power, the proposed design outperforms arxiv.org/abs/ 1506.01497.
the previous works irrespective of the precision. [3] J. Redmon, A. Farhadi, “YOLO9000: Better, Faster, Stronger,” [Online].
Available: arxiv.org/abs/1612.08242.
VI. CONCLUSION [4] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
This paper presents a high-performance hardware-efficient in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA),
2015, pp. 161–170.
streaming architecture for real-time object detection by
[5] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra,
quantizing the network and optimizing the data path to H. Esmaeilzadeh, “From high-level deep neural models to FPGA,” in
eliminate the off-chip access for intermediate data. With batch Proc. IEEE/ACM Int. Symp. Microarchitecture, 2016.
processing of streaming, the proposed design achieves a [6] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
throughput of 1.877 TOPS without increasing the hardware accelerators,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, 2016.
cost, which outperforms most previous designs. It is worth [7] D. T. Nguyen, H. Kim, H.-J. Lee, and I.-J. Chang, “An approximate
memory architecture for a reduction of refresh power consumption in
mentioning that the deeper the network is, the more efficient its deep learning applications,” in Proc. IEEE Int. Symp. Circuit Syst.
architecture. Therefore, the proposed design is expected to (ISCAS), May 2018, pp. 1-5.
significantly contribute to the real-time object detection.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12

[8] M. Restegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: [32] Zynq Ultrascale+ MPSOC datasheet, Xilinx Inc., 2018.
ImageNet Classification Using Binary Convolutional Neural Networks,”
[Online]. Available: arxiv.org/abs/1603.05279.
[9] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, “Incremental Network
Duy Thanh Nguyen received the B.S.
Quantization: Towards Lossless CNNs with Low-Precision Weights,” degree in electrical engineering from Hanoi
[Online]. Available: arxiv.org/abs/1702.03044. University of Science and Technology,
[10] M. Courbariaux, I. Hubara, D. Seoudry, R. El-Yaniv, and Y. Bengio, Hanoi, Vietnam, M.S. degree in Electrical
“Binarized Neural Networks: Training Deep Neural Networks with and Computer Engineering from Seoul
Weights and Activations Contrained to +1 or -1,” [Online]. Available:
arxiv.org/abs/1602.02830. National University, Seoul, Korea, in 2011
[11] F. Sun, C. Wang, L. Cong, C. Xu, Y. Zhang, Y. Lu, X. Li, X. Zhou, “A and 2014, respectively. He is currently
high performance Accelerator for Large-Scale Convolutional Neural working toward the Ph.D. degree in
Networks,” in Proc. Int. Conf. Parallel Distributed Process. Appl., 2017. Electrical and Computer Engineering at
[12] Y. Shen, M. Ferdman, P. Milder, “Maximizing CNN Accelerator Seoul National University.
Efficiency Through resource partitioning,” in Proc. Int. Symp. Comput. His research interests include computer architecture,
Archit., 2017, pp. 535-547.
memory system, SoC design for computer vision applications.
[13] Q. Xiao, Y. Liang, L. Lu, S. Yan, Y.-W. Tai, “Exploring Heterogeneous
Algorithms for Accelerating Deep Convolutional Neural Networks on
FPGAs,” in Proc. Design Auto. Conf., 2017. Tuan Nghia Nguyen received the B.S.
[14] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high degree in Electronics and
performance FPGA-based Accelerator for Large-Scale Convolutional Telecommunications from Hanoi
Neural Networks,” in Proc. Int. Conf. Field-Program. Logic Appl., 2016.
University of Science and Technology,
[15] Y. Ma, Y. Cao, S. Vrudlhura, J.-S. Seo, “ An automatic RTL compiler for
high-throughput FPGA implementation of diverse deep convolutional
Hanoi, Vietnam. He is currently working
neural networks,” ,” in Proc. Int. Conf. Field-Program. Logic Appl., 2017. toward the M.S. degree in Electrical and
[16] Shmuel Winograd, “Arithmetic Complexity of Computation,”, vol. 33, Computer Engineering from Seoul
Siam, 1980. National University, Seoul, Korea.
[17] U. Aydonat, S. O’Connel, D. Capalija, A. C. Ling, G. R. Chiu, “An His research interests are computer
OpenCL(TM) Deep Learning Accelerator on Arria10,” in Proc. vision, deep learning applications, and computer architecture.
ACM/SIGDA Symp. Field-Program. Gate Arrays, 2017, pp. 55–64.
[18] L. Lu, Y. Liang, Q. Xiao, S. Yan, “Evaluating fast algorithm for
convolutional neural networks on FPGA,” in Proc. IEEE Int. Symp. Field- Hyun Kim received the B.S., M.S. and
Program. Custom Comput. Mach., 2017. Ph.D. degrees in Electrical Engineering and
[19] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Long, M. Jahre, Computer Science from Seoul National
K. Visser, “FINN: A framework for Fast, Scalable Binarized Neural University, Seoul, Korea, in 2009, 2011
Network Inference,” in Proc. IEEE Int. Symp. Field-Program. Custom
Comput. Mach., 2017, pp. 65-74.
and 2015, respectively. From 2015 to 2018,
[20] S. Liang, S. Yin, L. Liu, W. Luk, S. Wei, “FP-BNN: Binarized neural
he was with the BK21 Creative Research
network on FPGA,” J. Neurocomputing, vol. 275, No. C, pp. 1072-1086, Engineer Development for IT, Seoul
2018. National University, Seoul, Korea, as a BK
[21] T. B. PreuBer, G. Gambardella, N. Fraser, M. Blott, “Inference of Assistant Professor. In 2018, he joined the
Quantized Neural Networks on Heterogeneous All-Programmable Department of Electrical and Information Engineering, Seoul
Devices,” in Proc. IEEE Design, Auto. Test Eur. Conf., 2018.
National University of Science and Technology, Seoul, Korea,
[22] H. Nakahara, H. Yonekawa, T. Fujii, S. Sata, “A lightweight YOLO v2:
A Binaized CNN with Parallel Support Vector Regression for an FPGA,” where he is currently working as an Assistant Professor. His
in Proc. ACM/SIGDA Symp. Field-Program. Gate Arrays, 2018, pp. 31– research interests are the areas of algorithm, computer
40. architecture, memory, and SoC design for low-complexity
[23] S. Ioffe, C. Szegedy, “Batch normalization: accelerating deep network multimedia applications and deep neural networks.
training by reducing internal covariate shift,” in Proc. Int. Conf. Mach.
Learn., 2015, pp. 448–456.
Hyuk-Jae Lee received the B.S. and M.S.
[24] D. D. Lin, S. S. Talathi, V. S. Annapureddy, “Fixed Point Quantization of
Deep Convolutional Networks,” in Proc. Int. Conf. Mach. Learn., 2016, degrees in electronics engineering from
pp. 2849–2858. Seoul National University, South Korea, in
[25] Y.-H. Chen et. al., “Eyeriss: A spatial architecture for energy-efficient 1987 and 1989, respectively, and the Ph.D.
dataflow for convolutional neural networks,” IEEE J. Solid-State Circuits, degree in Electrical and Computer
vol. 52, no. 1, pp. 127-138, 2017.
Engineering from Purdue University, West
[26] 7 Series DSP48E1 Slice, Xilinx Inc., 2018.
Lafayette, IN, in 1996. From 1996 to 1998,
[27] Darknet Deep Learning Framework. [Online]. Available:
https://github.com/pjreddie/darknet.
he was with the Faculty of the Department
[28] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
of Computer Science, Louisiana Tech
large-scale image recognition”, [Online]. Available: arXiv:1409.1556. University, Ruston, LS. From 1998 to 2001, he was with the
[29] Y. Ma, Y. Cao, S. Vrudhura, J.-.S Seo, “Optimizing the convolution Server and Workstation Chipset Division, Intel Corporation,
operation to accelerate deep neural networks on FPGA,” IEEE Trans. Hillsboro, OR, as a Senior Component Design Engineer. In
Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354-1367, 2018. 2001, he joined the School of Electrical Engineering and
[30] N. J. Fraser, Y. Umuroglu, G. Gambardella, M. Blott, P. Long, M. Jahre, Computer Science, Seoul National University, South Korea,
K. Visser, “Scaling Binarized Neural Networks on Reconfigurable
Logic,” [Online]. Available: arxiv.org/abs/1701.03400. where he is currently a Professor. He is a Founder of Mamurian
[31] Y. Li, Z. Liu, K. Xu, H. Yu, F. Ren, “A GPU-Outperforming FPGA Design, Inc., a fabless SoC design house for multimedia
accelerator architecture for Binary Convolutional Neural Networks,” in J. applications. His research interests are in the areas of computer
Emerg. Technol. Comput. Syst., vol. 14, no. 2, Article 18, 2018. architecture and SoC design for multimedia applications.

You might also like