FPGA Design for Object Detection
FPGA Design for Object Detection
(for storing intermediate data) and DSPs (due to additional hardware accelerator processes these hidden layers one by one.
control logic). Similar to [4], the CNN accelerator in [11] Moreover, the first and last layers are run on software causing
optimizes the data path using loop unrolling and tiling for a low frame rate. Lightweight YOLO-v2 is proposed in [22] to
enhanced performance of each layer. The authors also use combine a binary network with support vector machine (SVM)
Vivado HLS to design the CNN accelerator for each layer regression. The authors design a shared streaming binary
which is mapped to the processing engine in their accelerator in convolutional circuit in which each layer is processed
a pipelined manner. However, the entire intermediate feature- sequentially. Although these previous designs succeed in the
maps generated from each layer are stored in a double buffer so speed up by reducing the complexity of the algorithm, they do
that this scheme does not scale well when the CNN becomes not consider the reduction of external memory accesses.
deeper owing to the demand of large buffers. The design in [14] To avoid the frequent off-chip access for intermediate data
also faces the same problem even though it delivers a high or large inter-layer double buffers caused by un-optimized data
performance for the AlexNet network. Another recent work path in previous works, this paper proposes an efficient Tera-
using Vivado HLS in [12] employs the same optimization as OPS streaming architecture design. YOLOv2 network [3] is
proposed in [4]. In addition, the available resources are used for evaluating the performance of the proposed FPGA
partitioned to make multiple convolutional layer processors design in terms of both hardware performance and detection
(CLP) of smaller size rather than a single large CLP. This work accuracy. The network is retrained and quantized using 1-bit
proposes a scheme to decide the number of required CLPs, the weight and flexible low-bit activation. The main contributions
resource partitioning among these CLPs, and a scheduling of this paper are summarized as follows:
algorithm to utilize their concurrent operations effectively. As
the network is not quantized, the intermediate data are generally A binary weight, flexible low-bit activation, hardware-
too large to be stored in on-chip memory. Hence, all CLPs read centric quantization, and a retraining method for
their inputs and write their outputs to external memory. YOLO CNN are presented. This study shows that even
Consequently, this design requires a very high memory the binary weight and 3-6-bit activation are adequate
bandwidth. Facing a similar problem, an approach in [15] to realize the desired accuracy of object detection. The
presents an RTL compiler to generate an RTL code for each advantages of this quantization are as follows: (i) It
layer of given network. In this design, each layer also reads requires a minimum number of DSPs as the
inputs and writes outputs to a DRAM. Each layer operates convolutional kernel contains only summations; (ii)
sequentially which means that the next layer starts only when Binary weight enables storing the entire network
the current layer finishes its computation. This non-pipelined model in on-chip memory to minimize the off-chip
processing and frequent accesses to external DRAM lower the accesses, thereby enhancing the performance.
processing speed significantly. A scalable and high accuracy streaming architecture
Unlike a conventional convolution, the Winograd minimal for real-time object detection is proposed. The
filtering algorithm introduced in [16] is employed in [17] and intermediate data are reused to minimize the size of the
[18] to speed up the convolutional computations. In [17], input buffer of each convolution layer while
additional optimizations including loop unrolling and tiling are eliminating the accesses to the off-chip memory. The
proposed to increase the throughput up to 1,382 GOPS for convolutional layers are fully parameterized. Thus, it
AlexNet. With the same filtering algorithm, the design in [18] is easy to change the network structure.
achieves a throughput of 2.94 TOPS for VGG network. The proposed architecture is implemented, and its
Nevertheless, this design still demands an excessive number of relative merits are highlighted by comparing with
DSPs and LUTs even though the Winograd algorithm reduces previous works. A real-time demo for the object
the number of multipliers significantly. Moreover, the design of detection is also presented.
a single large convolutional layer has an inherent drawback. The proposed method can be easily extended to the
The authors also report that the performance of the network previous designs as well as YOLO-v2. It can also
decreases as it goes deeper owing to the overhead of data expect considerable enhancement in throughput by
transfer back and forth between the CNN accelerator and solving the off-chip access that suffered in the
external memory. previous designs.
To reduce expensive external memory accesses, the
resolution of number representation is reduced by [19] and [20]. The rest of this paper is organized as follows. Section II
They aggressively quantize the weight and the activation to a introduces the CNN, YOLO-v2, and low-precision network
single bit. The MAC operation is replaced with a low-cost pop- quantization/retraining. Section III presents optimization of
count computation, and the comparator in max-pooling layer is algorithm for the proposed design. The proposed architecture is
implemented by an OR gate. The authors in [20] report the need elaborated in Section IV. The experimental results are shown in
for floating-point number for batch normalization to avoid Section V. Finally, Section VI concludes the paper.
severe degradation of the accuracy. This shows an example that
the performance of binary network is very poor for a II. BACKGROUND
challenging dataset such as ImageNet.
For the implementation of YOLO, several FPGA designs A CNN is typically composed of basic layers: convolution,
have been proposed. Tincy YOLO, presented in [21], uses an normalization, pooling, and fully connected. Considering that
extended version of the design in [19] to offload 12 hidden the focus is on object detection, the fully connected layer is not
layers to programmable logics in Zynq Ultrascale+ FPGA. The discussed in this paper.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3
III. ALGORITHMIC OPTIMIZATION FOR THE PROPOSED GOP), the number of bits for activation ranges from 3 to 6 bits.
STREAMING ARCHITECTURE The chosen step size is a power-of-two value so that the
A. Hardware-centric quantization quantization requires only shift operations instead of
multiplications. It should be noted that this quantization is a
This study presents a method to train a low-precision model non-zero scheme (symmetric quantization). There is no zero
for the proposed streaming hardware accelerator. Previous value in the quantized output (i.e., quantized value can
studies in [8] and [20] show that the last layer is highly sensitive
be±1, ±3, …). The zero-center quantization scheme performs a
to low-precision quantization. Therefore, the weights in this
bit worse than the former scheme. Moreover, it has odd number
layer are quantized to 8-bit fixed point, and the activations are
of quantization levels (i.e., the quantized value can
quantized to 16-bit fixed point to minimize the loss of accuracy
be 0, ±1, ±2, … ). Thus, one quantized level is wasted. The
of quantization. In other layers including the first layer, the
experiments show that symmetric quantization performs better
weights and output activations are quantized to 1-bit and 3-to-
than the zero-centered quantization. The quantized network
6-bit, respectively. It is noteworthy that the input image is in
with 1-bit weight and flexible low-bit activation reduces the
RGB format for the first layer.
model size by approximately 30 ×, and the activation size is
Optimization for batch normalization:
reduced by 5.4×. Moreover, the recent FPGA generations have
To reduce the number of calculations at inference phase, (1)
a rich on-chip SRAM resource. For example, 7 Series VC707
is reformulated as below:
FPGA board has 1030 units of 36-Kb Block RAMs
(𝑖) (𝑖)
(approximately 4.6 MB), and Virtex Ultrascale+ FPGA chip
𝑦 = 𝑥𝑊(𝑖) × 𝛾𝑤 + 𝛽𝑤 (7) includes on-chip memory integration up to 500 Mb. Hence, this
quantization enables storing the entire model of Sim-YOLO-v2
(𝑖) (𝑖)
where 𝛾𝑤 and 𝛽𝑤 are the new scale and bias factors that can (or even much deeper networks) in Block RAMs of FPGA chip.
be computed beforehand: As a result, the off-chip memory accesses are significantly
reduced. Thus, the system performance is boosted, and power
(𝑖)
(𝑖)
𝜇𝑊 ×𝛾 (𝑖) dissipation is reduced. Besides, the quantization also helps
𝛾𝑤 = (8) reduce the hardware cost (by removing expensive
2
√[𝜎 (𝑖) ] +𝜀
multiplications).
(𝑖)
(𝑖)
𝜇𝑊 ×𝜇 (𝑖) B. Data-path optimization for the streaming computation
𝛽𝑤 = − + 𝛽 (𝑖) (9)
2
√[𝜎 (𝑖) ] +𝜀 Algorithm 1 explains the original loop computation for a
convolutional layer. To run it efficiently on a dedicated
hardware with limited resources, the loop computation needs to
As the batch normalization parameter is sensitive to small
be optimized. To solve this problem, the loop reordering and
errors, the new scale and bias factors are quantized to 16-bit
tiling are proposed in [6], [11], [13], and [14]. Nevertheless, the
fixed point value to minimize the accuracy loss. By using (7),
output of the entire intermediate feature-maps from each layer
the hardware for batch normalization requires only one
are stored in Block RAMs. Moreover, designs in [6], [11], and
multiplication and one addition, thereby reducing the data-path
[14] use doubled-buffer to pipeline the computation. This
delay.
scheme does not scale well when the CNN becomes deeper
Leaky Rectified Linear Unit (Leaky ReLU):
because it consumes a large number of Block RAMs. For
𝑥 𝑖𝑓 𝑥 > 0 example, Tiny YOLO v2 has nine convolutional layers, and the
𝑔(𝑥) = { (10) number of feature-maps is 5.8 million. If each feature-map is
𝑎𝑥 𝑖𝑓 𝑥 ≤ 0
quantized to 16-bit, it requires 5.8*2*2=23.2 MB of Block
RAM for the doubled-buffer. To reduce the size of Block RAM,
As compared to ReLU, the leaky ReLU helps prevent the
studies in [4], [12], [18], [20], and [25] save the intermediate
neurons from dying during training, thus, it is more stable. The
data for each layer in off-chip memory. Hence, the frequent off-
leaky coefficient a is chosen as 0.125 empirically for both
chip accesses slow down the computation thereby consuming
training and inference phases to replace floating-point
more power.
multiplication by 3-bit right shift operation.
The target of this paper is to eliminate the off-chip accesses
Flexible low-bit activation quantization:
for intermediate data while minimizing the on-chip SRAM. To
Activations are quantized and represented by a fixed number
achieve this, there should be a data-path optimization to
of bits as shown in equations (4), (5), and (6). The research in
efficiently use the temporary data. As proposed by [4], this
[24] shows that each convolutional layer is quantized using a
paper also uses block-based computations to achieve the trade-
different number of bits while preserving the accuracy of the
off between hardware resource and performance. However,
quantized network. The layers, which have a large number of
scheduling the data movement efficiently in streaming
parameters, seem to be more redundant. Thus, they can be
convolutional computation is further investigated. Fig. 2
quantized using less number of bits. Following this finding, the
presents three scheduling schemes covering all the possibilities
activations from different layers of YOLO CNN are flexibly
of weight reuse. The advantages and drawbacks of each strategy
quantized. For Tiny YOLO-v2 [3] (6.97 GOP) and Sim-YOLO-
are analyzed below.
v2 (Simplified version of YOLO-v2 with 24 layers, 18.95
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5
...
H convolved with To weight blocks each time to produce To
H Sliding cube To
To-channel output
temporary output values. These weight blocks are reused for a
Block To
N row pass. These To computations are processed in parallel and
Input feature-maps Weight Blocks saved in the line buffers thereby creating To temporary output
(a) channels. The input sliding cube then shifts Ti channels toward
Ti the end of N-input channels. In the next row pass, new To
K
weight blocks are fetched and convolved with the sliding cube.
H Block 1 H
K The convolutional outputs are accumulated with corresponding
K values from line buffers and then saved in the line buffers. This
...
To
Ti Sliding cube
weights are read H times. As the weights are stored in on-chip
H
Block To
SRAM, and the weight prefetching can be applied to hide the
N latency, the weight accesses do not cause system degradation.
Input feature-maps Weight Blocks To line buffers The computation of a convolutional layer is performed when all
(c) lines are processed. It is noteworthy that each row of input
Fig. 2. The scheduling for streaming convolutional layer: (a) No weight reuse. feature-maps is reused K times. Regarding the hardware
(b) Fully weight reuse. (c) Proposed line-based weight reuse, input feature-
maps fully reuse. resource, the input buffer size for pipelining is
(K+1)×N×H×QA, and the temporary accumulation buffer size
In the first strategy in Fig. 2(a), there is no weight reuse. The is To×H×QS.
input sliding cube moves from the beginning toward the end of The partial output from a layer is input directly to the next
the channel dimension. Therefore, each sliding cube is layer without going back to external memory. In addition, it is
convolved with a new weight block. All these values are noteworthy that the partial output parameter To of a layer is the
accumulated to produce a final output. This scheme has the best partial input parameter Ti of the next layer to minimize the cost
locality of the partial sum. Thus, it does not require a temporary of the control logics and Block RAM banks. The parallelism
buffer for the accumulation. The input buffer size is parameters such as Ti and To are chosen to achieve the best
K×N×H×QA, where QA is the bit width of input feature-maps. trade-off between hardware cost and performance.
To overlap the computation between layers, the number of Table I summarizes the three scheduling schemes. It should
buffer rows is increased from K to (K+1). However, the weight be noted that QA is much smaller than QS after quantization. As
model needs to be read H2 times, which is inefficient for a large analyzed above, the first strategy (i.e., no weight reuse) is not
weight model. efficient enough owing to frequent weight read and no input
The second scheme shown in Fig. 2(b) maximizes the weight reuse (i.e., no overlapped sliding windows). The second scheme
reuse which is implemented in [11] and [14]. Each weight is (i.e., full weight reuse) has the best weight reuse among the
reused for the whole input channel (i.e., reuse H2 times). At a three schemes, but it operates with frame-wise computation,
time, Ti input planes are convolved with each of To weight which requires a large inter-layer doubled buffer for pipelining.
blocks. The temporary accumulations are stored in an output In batch mode, the buffer and convolution kernel increase
buffer. The SRAM size of this doubled output buffer is linearly as the batch size increases. The next layer can start only
2×To×H2×QS, where QS is the bit width of the accumulation after the entire output of the previous layer is computed. On the
before quantization. To produce final To output feature-maps, other hand, the proposed scheme requires smaller buffers and
the entire input feature-maps are accessed. Hence, to generate causes a smaller delay between layers (i.e., line delay). The
output feature-maps, the input feature-maps are repeatedly read weight prefetching can be used to hide the latency of the weight
M/To times. Because, the entire input feature-maps are read read. It is also noteworthy that the proposed scheme does not
multiple times, the temporary buffers must be large to store incur hardware resource overhead in batch mode. Therefore, the
them. Moreover, to pipeline between layers, the buffer size proposed line-based weight reuse scheme outperforms the other
should be doubled, which is 2×H2×N×QA. schemes in terms of both hardware cost and performance.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6
TABLE I PE. In each PE, the Ti 3 × 3 input data are convolved with
COMPARISON OF THREE SCHEDULING SCHEMES FOR STREAMING
PROCESSING
corresponding 1-bit 3×3 weight kernel as illustrated in Fig. 4(b).
Line-based Then, the 3 × 3 results are summed up using 2-stage ternary
Fully weight
Features No weight reuse
reuse
weight reuse adders (i.e., 3-input adder). The results from each T i kernel are
(proposed) input to a pipelined adder tree. The output data from the adder
Input buffer size (K+1)×N×H×QA 2×H2×N×QA (K+1)×N×H×QA
tree are saved temporarily to buffers. In the next iteration, the
Output buffer
0 2×To×H2×QS To×H×QS next Ti input channels are sent to the PEs, the outputs from
size
Weight read
H2 1 H
adder tree are summed up with corresponding value in buffers,
(times) and then the resulting values are saved to buffers. This iteration
Weight reuse
1 H2 H is performed when N input channels are sent to the PEs, and the
(times)
Relative latency 1 H 2
H values in the temporary buffer are the final convolutional
values. The computation for each line is performed when all M
VC707 FPGA board output channels are calculated. This input line is no longer
needed, hence, it can be replaced by the next line. Thus, this
Batch Max
Control
CONV Norm pooling feature-maps reuse scheme does not require writing back to
YOLO DRAM, and consequently, the memory bandwidth can be
PCIE
Concatenate
Adder Adder
N w23 1
x23 6 Signed inverse
Partial
Block Prefetch output
w31 1
x31 6 Signed inverse
shift quantize
RAMs weights w32 1 Ternary
x32 6 Signed inverse Adder
Batch Norm w33 1
Signed inverse
x33 6
(a) (b)
Fig. 4. The streaming design of a convolutional layer. (a) The architecture of a 3×3 convolutional layer. (b) The design for 1-bit weight and low-bit activation 3x3
Kernel.
...
...
...
...
For efficient prefetching of weight in order to speed up the
convolutional computation, the weights in Block RAMs need
to be in a predefined pattern. Fig. 6 describes the proposed
weight memory pattern to support partial convolution. Each
weight block contains weights for T o PEs, each of which
computes the partial convolution for T i input channels. Weight Partial-line inputs
blocks are stored sequentially in memory according to the Fig. 5. The memory storage for a line buffer composed of multiple partial lines.
processing order. Weights from memory are loaded to the
weight buffer block-by-block at consecutive addresses. This
Offset address 0
weight block is reused for each line of the Ti input channels. To
Block start
produce one line of final output, the whole weight for that layer address
is loaded. Hence, each layer needs to load image-width number W[1][1][1:K][1:K]
of times to process the entire image. For layers with a large Ti
...
number of kernels, it does not incur many weight reloads from K PE 1
W[1][Ti][1:K][1:K]
K
Block RAMs as the width of the input image is rather small.
W[2][1][1:K][1:K]
C. Streaming Design of Max-pooling Layer Ti PE 2
...
K
The streaming design of the 2×2 partial input max-pooling K W[2][Ti][1:K][1:K]
layer is depicted in Fig. 7. It is noteworthy that this layer
...
Input
...
requires one line of buffer with a depth of a half of the width of sliding
PE To W[To][1][1:K][1:K]
input image. The partial input from convolutional layer is cube
...
latched and the latched input is then compared to the input. If Block n
Weights
the current row is even (assume that row counts from zero), the W[To][Ti][1:K][1:K] Block 2
Block 1
comparison results are stored in the line buffer. Otherwise (i.e.,
Block-based convolution Weight memory storage
current row is odd), the comparison results are compared one
Fig. 6. Weight memory pattern for efficient weight prefetching.
more time to the value in the corresponding address in the
buffer. The final comparison results are concatenated to
produce the Ti-channel outputs. Ti PEs
0
Partial
achieve the best performance of streaming architecture in terms row=0
row=1 > > output
of both throughput and hardware cost. As shown in Fig. 2, the Partial input
1
channel inputs to produce M-channel outputs, the number of the number of DSPs in each layer except the last layer is only
repetitions of the partial computation and accumulation to To. As the last layer requires multiplications of weights and
produce a final output is Ni=N/ Ti. In addition, To output input feature-maps, the number of DSPs is Ti×To+To. Given the
channels are computed in parallel. Thus, the number of number of available DSPs, the parallelism parameters of all
repetitions required to compute all output channels is Mo=M/ layers must satisfy the following condition:
To. It is noteworthy that Ti and To are divisors of N and M,
respectively, to avoid a complicated design and underutilization (𝑛)
𝑇𝑜
(𝑛−1)
× 𝑇𝑜
(𝑖)
+ ∑𝑛𝑖=1 𝑇𝑜 ≤ 𝐷𝑆𝑃𝑠𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 (19)
of the computation kernels.
The delay of each block-based computation for each kernel (𝑖)
Where n is the number of layers, and 𝑇𝑜 is the To value of n-
size (i.e., 1×1, 3×3) is given below:
th layer. For each PE in Fig. 4, there are 4×Ti ternary adders and
Ti+1 binary adders. Therefore, in the convolutional kernel in
𝑡1×1 = 8 + log(𝑇𝑖 ) (12)
each layer, there are To × (5×Ti+1) adders. It is noteworthy that
the ternary adder is implemented efficiently using the LUT6
𝑡3×3 = 10 + log(𝑇𝑖 ) (13) and carry chain in the same slice for the Virtex-7 FPGA chip,
thereby saving the area and achieving high frequency [26].
Here, the specific numbers such as 8 and 10 are present owing Consequently, it requires only two cycles to reduce nine inputs
to certain pipeline stages in the convolutional kernel. to a single output while guaranteeing a high speed of 200 MHz.
Hence, the computation time per line for each layer (depends
on the kernel size) is given follows: E. Batch processing
As shown in Fig. 8, the streaming architecture with
𝑁 𝑀
𝑇1×1 = (8 + log(𝑇𝑖 )) × 𝐶 × × (14) pipelining enable the ability to run computation in batch mode
𝑇𝑖 𝑇𝑜
to fully utilize the pipeline processing. Each layer is delayed by
𝑁 𝑀 1-to-4 input lines from the previous layer. Because the network
𝑇3×3 = (10 + log(𝑇𝑖 )) × 𝐶 × × (15) is very deep, the delay from the first layer to the last layer
𝑇𝑖 𝑇𝑜
becomes large. If single frame is processed at a time, the first
Where C is the width of the input image. It should be noted that layers are underutilized for a definite period. The idea of batch
the computation time varies for each group of layer. For processing is that, while the layers at the end of the networks
example, owing to the 2 × 2 max-pooling layer, 2 lines are run the first image, the first layers, that finish running the first
computed in CONV1 to produce 1 line for CONV2. Similarly, image, can start running for the second image to utilize the idle
CONV2 needs to compute 2 lines to produce 1 line for CONV3, period. The batch processing mode increases the throughput
and so on. To achieve the maximum efficiency of pipeline significantly.
processing, the computation time for each layer should be To define the processing time and speedup, the T, D, L, and
similar. It is noteworthy that the Ti parameter of the current n are defined as follows. T is the frame processing time for the
layer is the To value of the previous layer. Therefore, the last layer, D is the delay from the first layer to the last layer, L
parallelism factors for two consecutive convolutional layers is the latency between the last layers of two consecutive frames,
(denoted in layer 1, 2 below) present in different groups should and n is the batch-size. Processing time of a frame for normal
satisfy the following condition: mode is T+D. The processing time for a batch size of n is
D+n×T+(n-1) ×L. The speedup can be derived as below:
𝑁1 𝑀1
2 × (10 + log(𝑇𝑖1 )) × 𝐶1 × ×
𝑇𝑖1 𝑇𝑜1 𝑇+𝐷
𝑁2 𝑀2 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = (20)
≈ (10 + log(𝑇𝑖2 )) × 𝐶2 × × (16) 𝑇+𝐿+(𝐷−𝐿) 𝑛
𝑇𝑖2 𝑇𝑜2
TABLE II
Image 1 Image 2 Image n
ACCURACY (MEAN AVERAGE PRECISION) OF QUANTIZED TINY YOLO CNN
conv1 conv1 conv1 (W: WEIGHT, A: ACTIVATION)
conv2 conv2 conv2 Weight Activation Multiplication
Accuracy
Quantization size size for
conv3 conv3 conv3 (%)
(MB) (MB) convolution
Baseline
...
...
...
53.96 60.53 22.0 Yes
conv17 conv17 conv17 (32-b W, 32-b A)
1-b W, 6-b A 51.44 2.01 4.1 No
Fig. 8. Batch processing. 1-b W, 4-b A 49.12 2.01 2.73 No
1-b W, 3-b A 45.0 2.01 2.05 No
The model size is reduced by 30× and the activation size is
reduced by 5.4×. TABLE III
ACCURACY (MEAN AVERAGE PRECISION) OF QUANTIZED YOLO-V2 AND
To justify the efficiency of the proposed streaming
SIM-YOLO-V2
architecture in computing the deeper networks, the quantization Accuracy Weight Complexity
of a deeper network is performed. This study selects a Networks Quantization
(%) size (MB) (GOP)
simplified version of YOLO-v2 as a baseline for the Full precision 75.88 258 34.9
(1) YOLO -
quantization. This network, so-called Sim-YOLO-v2, inherits 1-b W, 32-b A 71.56 8.1 17.45
v2
the structure of YOLO-v2 except the last layers for multi-scale 1-b W, 6-b A 71.11 8.1 17.45
detection [3]. As the research in [3] points out that the pass- Full precision 72.0 79.74 18.95
(2) Sim-
through layers, that fetch features from an earlier layer to the 1-b W, 32-b A 66.99 2.54 9.48
YOLO-v2
final outputs, enable a modest increase in performance (i.e., 1-b W, 6-b A 65.76 2.54 9.48
Full precision 66.79 58.28 17.18
1%), these pass-through layers are removed to save the (3) Sim- 1-b W, 32-b A 64.95 1.88 8.59
computation budget. Hence, the Sim-YOLO v2 contains 19 YOLO-
v2 FPGA 1-b W, 6-b A 65.07 1.88 8.59
convolution layers and 5 max-pooling layers. Its architecture is
1-b W, 4-to-6-b A 64.16 1.88 8.59
the same as Darknet-19 network in [3]. Table III presents the
accuracy of the quantized networks in YOLO-v2 and Sim-
YOLO-v2 as compared to the full-precision network. TABLE IV
YOLO-V2 NETWORK ARCHITECTURE AND PARALLELISM FACTORS FOR EACH
The binary version of these above networks attains an LAYER
accuracy comparable with that of the full-precision networks, Size / Filter Out bit PF***
Layer Type Input Size Output Size
mean-while, saving a significant amount of computation. These Stride number width (Ti, To)
quantized networks are used for the verification of the 0 C* 3×3 / 1 32 416×416×3 416×416×32 6 (3, 32)
1 M** 2×2 / 2 416×416×32 208×208×32 6 (8, 8)
streaming architecture. It is noteworthy that the deeper 2 C* 3×3 / 1 64 208×208×32 208×208×64 6 (8, 8)
networks perform better in terms of quantization. The binary 3 M** 2×2 / 2 208×208×64 104×104×64 6 N/A
network numbered (3), which has 17 convolution layers and 5 4 C* 3×3 / 1 128 104×104×64 104×104×128 6 (8, 8)
max-pooling layers (removed 2 CONV layers from Sim- 5 C* 1×1 / 1 64 104×104×128 104×104×64 6 (8, 8)
6 C* 3×3 / 1 128 104×104×64 104×104×128 6 (8, 8)
YOLO-v2 (2)), achieves an accuracy of 64.16%, which is 7 M** 2×2 / 2 104×104×128 52×52×128 6 N/A
12.78% higher than that of the quantized Tiny YOLO-v2 in 8 C* 3×3 / 1 256 52×52×128 52×52×256 6 (8, 8)
Table II. In addition, its binary weight size is only 1.88 MB, and 9 C* 1×1 / 1 128 52×52×256 52×52×128 6 (8, 8)
10 C* 3×3 / 1 256 52×52×128 52×52×256 6 (8, 16)
therefore, it can be stored entirely in Block RAMs in an FPGA. 11 M** 2×2 / 2 52×52×256 26×26×256 6 N/A
Even though it is three times more complex than Tiny YOLO- 12 C* 3×3 / 1 512 26×26×256 26×26×512 6 (16, 8)
v2, a high throughput is achieved by virtue of its streaming 13 C* 1×1 / 1 256 26×26×512 26×26×256 6 (8, 16)
architecture. To provide detailed information of the network 14 C* 3×3 / 1 512 26×26×256 26×26×512 6 (16, 8)
15 C* 1×1 / 1 256 26×26×512 26×26×256 6 (8, 16)
topology, the architecture of the implemented network is 16 C* 3×3 / 1 512 26×26×256 26×26×512 4 (16, 8)
described in Table IV. 17 M** 2×2 / 2 26×26×512 13×13×512 4 N/A
18 C* 3×3 / 1 1024 13×13×512 13×13×1024 6 (8, 16)
B. Implementation Result 19 C* 1×1 / 1 512 13×13×1024 13×13×512 4 (16, 8)
20 C* 3×3 / 1 1024 13×13×512 13×13×1024 6 (8, 16)
The analysis in Section IV-D defines the guidelines to 21 C* 1×1 / 1 125 13×13×1024 13×13×125 16 (16, 5)
achieve balanced pipelining. The parallelism factors for each Note: C*=Convolution, M**=Maxpool, PF***= Parallelism Factors
layer are empirically chosen according to these guidelines.
These factors satisfy the FPGA resource constraints and the design on VC707 Evaluation board. The detector can detect 20
target operating frequency (i.e., 200 MHz). The detailed objects in the PASCAL VOC datasets at 30 fps.
parallelism factors for each layer are listed in Table IV. The batch processing increases the throughput significantly
The framework in Fig. 3 is used to verify the operation of the as it improves the utilization of each convolutional layer. The
proposed design. Input images and control commands from the experiments at 200 MHz show that T = 7.5 ms, D = 8.975 ms,
host PC are sent to the accelerator through PCI Express port. and L = 1.43 ms. According to equation (20), the dependence
After the computation, the detection results are sent back to the of the speedup on the batch size is illustrated in Fig. 10. The
host PC for post-processing. Fig. 9 demonstrates the real-time speedup is almost saturated at 1.8 with batch size = 30. Table V
object detection of the YOLO network using the proposed shows the performance of the proposed hardware design with
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10
TABLE VI
COMPARISON OF THE PROPOSED DESIGN WITH THE PREVIOUS WORKS FOR YOLO CNN HARDWARE
Sim-YOLO-v2 on Tincy YOLO Lightweight YOLO-v2 This work This work
GPU [3] [21] [22] (Tiny YOLO-v2) (Sim-YOLO-v2)
Zynq Ultrascale+ (16 Zynq Ultrascale+ (16 Virtex-7 VC707 (28 Virtex-7 VC707 (28
Platform GTX Titan X (16nm)
nm) nm) nm) nm)
Frequency 1 GHz N/A 300 MHz 200 MHz 200 MHz
BRAMs (18 Kb) N/A N/A 1706 1026 1144
DSPs N/A N/A 377 168 272
LUTs - FFs N/A N/A 135K – 370K 86K – 60K 155K – 115K
CNN Size (GOP) 17.18 4.5 14.97 6.97 17.18
Precision (W, A)(**) (32, 32) (1, 3) (1-32, 1-32) (1, 6) (1, 6)
Image Size 416×416 416×416 224×224 416×416 416×416
Frame rate 88 16 40.81 66.56 109.3
Accuracy (mAP) (%) 66.79 48.5 67.6 51.38 64.16
Throughput (GOPS) 1512 72 610.9 464.7 1877
Efficiency (GOPS/kLUT) N/A N/A 4.52 5.40 12.11
Power (W) 170 6 N/A 8.7 18.29
Power efficiency (GOP/s/W) 8.89 12 N/A 53.29 102.62
Note: (**): W: Weight, A: Activation
TABLE VII
COMPARISON OF THE PROPOSED DESIGN WITH THE PREVIOUS WORKS FOR OTHER CNN HARDWARE
Neurocomputing
FPGA’17 [19] HiPEAC’17 [30] FFCM’17 [18] TVLSI’18 [29] This work
[20]
Kintex Ultrascale Stratix-V Intel Arria 10
Platform Zynq XC7Z045 Zynq Ultrascale+ Virtex-7 VC707
XCKU115 5SGSD8 GX 1150
Frequency (MHz) 200 125 150 200 200 200
BRAMs (18Kb) 186 1814 2210 (*) 1824 2232 (*) 1214
DSPs N/A N/A 384 2520 1518 272
LUTs 46.3K 392.9K 230.9 (*) 600K 138K (*) 104.7K
FFs N/A 348K N/A N/A N/A 140.1K
CNN Size (GOP) 0.1125 1.2 1.45 5 layers of VGG 30.95 30.74
Precision (W, A) (1, 1) (1, 1) (1, 1) (16, 16) (16, 16) (1, 2)
Image Size 32×32 32×32 224×224 224×224 224×224 224×224
Throughput (GOPS) 2463.8 14814 1964.0 2940.7 715.9 4420
Efficiency (GOPS/kLUT) 53.2 37.7 8.51 4.90 5.19 42.22
Power (W) 11.7 N/A 26.2 23.6 N/A 14.72
Power efficiency (GOP/s/W) 210.58 N/A 74.96 124.6 N/A 300.27
Note: (*) for Intel FPGA: Block RAM (20Kb), Logic cells (ALM)
[8] M. Restegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: [32] Zynq Ultrascale+ MPSOC datasheet, Xilinx Inc., 2018.
ImageNet Classification Using Binary Convolutional Neural Networks,”
[Online]. Available: arxiv.org/abs/1603.05279.
[9] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, “Incremental Network
Duy Thanh Nguyen received the B.S.
Quantization: Towards Lossless CNNs with Low-Precision Weights,” degree in electrical engineering from Hanoi
[Online]. Available: arxiv.org/abs/1702.03044. University of Science and Technology,
[10] M. Courbariaux, I. Hubara, D. Seoudry, R. El-Yaniv, and Y. Bengio, Hanoi, Vietnam, M.S. degree in Electrical
“Binarized Neural Networks: Training Deep Neural Networks with and Computer Engineering from Seoul
Weights and Activations Contrained to +1 or -1,” [Online]. Available:
arxiv.org/abs/1602.02830. National University, Seoul, Korea, in 2011
[11] F. Sun, C. Wang, L. Cong, C. Xu, Y. Zhang, Y. Lu, X. Li, X. Zhou, “A and 2014, respectively. He is currently
high performance Accelerator for Large-Scale Convolutional Neural working toward the Ph.D. degree in
Networks,” in Proc. Int. Conf. Parallel Distributed Process. Appl., 2017. Electrical and Computer Engineering at
[12] Y. Shen, M. Ferdman, P. Milder, “Maximizing CNN Accelerator Seoul National University.
Efficiency Through resource partitioning,” in Proc. Int. Symp. Comput. His research interests include computer architecture,
Archit., 2017, pp. 535-547.
memory system, SoC design for computer vision applications.
[13] Q. Xiao, Y. Liang, L. Lu, S. Yan, Y.-W. Tai, “Exploring Heterogeneous
Algorithms for Accelerating Deep Convolutional Neural Networks on
FPGAs,” in Proc. Design Auto. Conf., 2017. Tuan Nghia Nguyen received the B.S.
[14] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high degree in Electronics and
performance FPGA-based Accelerator for Large-Scale Convolutional Telecommunications from Hanoi
Neural Networks,” in Proc. Int. Conf. Field-Program. Logic Appl., 2016.
University of Science and Technology,
[15] Y. Ma, Y. Cao, S. Vrudlhura, J.-S. Seo, “ An automatic RTL compiler for
high-throughput FPGA implementation of diverse deep convolutional
Hanoi, Vietnam. He is currently working
neural networks,” ,” in Proc. Int. Conf. Field-Program. Logic Appl., 2017. toward the M.S. degree in Electrical and
[16] Shmuel Winograd, “Arithmetic Complexity of Computation,”, vol. 33, Computer Engineering from Seoul
Siam, 1980. National University, Seoul, Korea.
[17] U. Aydonat, S. O’Connel, D. Capalija, A. C. Ling, G. R. Chiu, “An His research interests are computer
OpenCL(TM) Deep Learning Accelerator on Arria10,” in Proc. vision, deep learning applications, and computer architecture.
ACM/SIGDA Symp. Field-Program. Gate Arrays, 2017, pp. 55–64.
[18] L. Lu, Y. Liang, Q. Xiao, S. Yan, “Evaluating fast algorithm for
convolutional neural networks on FPGA,” in Proc. IEEE Int. Symp. Field- Hyun Kim received the B.S., M.S. and
Program. Custom Comput. Mach., 2017. Ph.D. degrees in Electrical Engineering and
[19] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Long, M. Jahre, Computer Science from Seoul National
K. Visser, “FINN: A framework for Fast, Scalable Binarized Neural University, Seoul, Korea, in 2009, 2011
Network Inference,” in Proc. IEEE Int. Symp. Field-Program. Custom
Comput. Mach., 2017, pp. 65-74.
and 2015, respectively. From 2015 to 2018,
[20] S. Liang, S. Yin, L. Liu, W. Luk, S. Wei, “FP-BNN: Binarized neural
he was with the BK21 Creative Research
network on FPGA,” J. Neurocomputing, vol. 275, No. C, pp. 1072-1086, Engineer Development for IT, Seoul
2018. National University, Seoul, Korea, as a BK
[21] T. B. PreuBer, G. Gambardella, N. Fraser, M. Blott, “Inference of Assistant Professor. In 2018, he joined the
Quantized Neural Networks on Heterogeneous All-Programmable Department of Electrical and Information Engineering, Seoul
Devices,” in Proc. IEEE Design, Auto. Test Eur. Conf., 2018.
National University of Science and Technology, Seoul, Korea,
[22] H. Nakahara, H. Yonekawa, T. Fujii, S. Sata, “A lightweight YOLO v2:
A Binaized CNN with Parallel Support Vector Regression for an FPGA,” where he is currently working as an Assistant Professor. His
in Proc. ACM/SIGDA Symp. Field-Program. Gate Arrays, 2018, pp. 31– research interests are the areas of algorithm, computer
40. architecture, memory, and SoC design for low-complexity
[23] S. Ioffe, C. Szegedy, “Batch normalization: accelerating deep network multimedia applications and deep neural networks.
training by reducing internal covariate shift,” in Proc. Int. Conf. Mach.
Learn., 2015, pp. 448–456.
Hyuk-Jae Lee received the B.S. and M.S.
[24] D. D. Lin, S. S. Talathi, V. S. Annapureddy, “Fixed Point Quantization of
Deep Convolutional Networks,” in Proc. Int. Conf. Mach. Learn., 2016, degrees in electronics engineering from
pp. 2849–2858. Seoul National University, South Korea, in
[25] Y.-H. Chen et. al., “Eyeriss: A spatial architecture for energy-efficient 1987 and 1989, respectively, and the Ph.D.
dataflow for convolutional neural networks,” IEEE J. Solid-State Circuits, degree in Electrical and Computer
vol. 52, no. 1, pp. 127-138, 2017.
Engineering from Purdue University, West
[26] 7 Series DSP48E1 Slice, Xilinx Inc., 2018.
Lafayette, IN, in 1996. From 1996 to 1998,
[27] Darknet Deep Learning Framework. [Online]. Available:
https://github.com/pjreddie/darknet.
he was with the Faculty of the Department
[28] K. Simonyan, A. Zisserman, “Very deep convolutional networks for
of Computer Science, Louisiana Tech
large-scale image recognition”, [Online]. Available: arXiv:1409.1556. University, Ruston, LS. From 1998 to 2001, he was with the
[29] Y. Ma, Y. Cao, S. Vrudhura, J.-.S Seo, “Optimizing the convolution Server and Workstation Chipset Division, Intel Corporation,
operation to accelerate deep neural networks on FPGA,” IEEE Trans. Hillsboro, OR, as a Senior Component Design Engineer. In
Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354-1367, 2018. 2001, he joined the School of Electrical Engineering and
[30] N. J. Fraser, Y. Umuroglu, G. Gambardella, M. Blott, P. Long, M. Jahre, Computer Science, Seoul National University, South Korea,
K. Visser, “Scaling Binarized Neural Networks on Reconfigurable
Logic,” [Online]. Available: arxiv.org/abs/1701.03400. where he is currently a Professor. He is a Founder of Mamurian
[31] Y. Li, Z. Liu, K. Xu, H. Yu, F. Ren, “A GPU-Outperforming FPGA Design, Inc., a fabless SoC design house for multimedia
accelerator architecture for Binary Convolutional Neural Networks,” in J. applications. His research interests are in the areas of computer
Emerg. Technol. Comput. Syst., vol. 14, no. 2, Article 18, 2018. architecture and SoC design for multimedia applications.