0% found this document useful (0 votes)
21 views4 pages

FPT2017 PipeCNN

Uploaded by

chienphan852003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

FPT2017 PipeCNN

Uploaded by

chienphan852003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

PipeCNN: An OpenCL-Based Open-Source FPGA

Accelerator for Convolution Neural Networks


Dong Wang, Ke Xu and Diankun Jiang
Institute of Information Science
Beijing Jiaotong University
Beijing 100044, China
Email: {wangdong, 17112071, 16125141}@bjtu.edu.cn

Abstract—Convolutional neural networks (CNNs) have been


employed in many applications, such as image classification, video ĞĞƉůLJWŝƉĞůŝŶĞĚKƉĞŶ><ĞƌŶĞůƐ
analysis and speech recognition. Being compute-intensive, CNNs
are widely accelerated by GPUs with high power dissipations.
Recently, studies were carried out exploiting FPGA as CNN ac-
celerator because of its reconfigurability and advantage on energy DĞŵZ ŽŶǀ͘ WŽŽůŝŶŐ DĞŵtZ >ZE

Ă

Ă

Ă
efficiency over GPU, especially when OpenCL-based high-level
synthesis tools are now available providing fast verification and
implementation flows. In this paper, we demonstrate PipeCNN
– an efficient FPGA accelerator that can be implemented on a
variety of FPGA platforms with reconfigurable performance and
cost. The PipeCNN project is openly accessible, and thus can 'ůŽďĂůDĞŵŽƌLJ
be used either by researchers as a generic framework to explore
new hardware architectures or by teachers as a off-the-self design EZĂŶŐĞ<ĞƌŶĞů ^ŝŶŐůĞͲƚŚƌĞĂĚĞĚ ŚĂŶŶĞůͬWŝƉĞƐ
example for any academic courses related to FPGAs. <ĞƌŶĞů

I. I NTRODUCTION Fig. 1. The top-level architecture of PipeCNN.


Convolutional neural network (CNN) [1], [2], as an emerg-
ing deep learning architecture, has received huge attentions
in various applications, such as video surveillance, image RTL specifications, have became increasingly popular in both
searching, speech recognition, and robot vision. Currently, academic and industrial fields. Compared with traditional
GPUs are widely adopted as hardware accelerators for training methodology, the HLS tools provide faster hardware develop-
deep neuron networks. Yet they are generally energy inefficient ment cycle and software-friendly program interfaces that can
for embedded applications. FPGAs, which provide massive be easily integrated with user applications [3].
processing elements, reconfigurable interconnections and low- In this paper, we introduce PipeCNN, an efficient OpenCL-
er power dissipation, are naturally suitable to implement neural based CNN accelerator on FPGAs. A set of configurable
network circuits. Moreover, FPGAs are also flexible with OpenCL kernels are designed to accelerate a wide range of
reduced data precision at circuit level, which will reduce the neural network models. Throughput and memory bandwidth
memory footprint and bandwidth requirements, resulting in optimization schemes are also presented and discussed. All
better energy efficiency than GPUs. the design files are openly accessible and can be downloaded
Studies, such as [4], [5], have reported efficient CNN ac- from [6].
celerators on embedded FPGA platforms. However, traditional In the final demo, PipeCNN was implemented and evalu-
register-transfer-level (RTL) design flows adopted in these ated on three different FPGA platforms, including Cyclone-
studies require deep background knowledge in digital circuit V SEA5 SoC, Stratix-V GXA7 and Arria-10 AX115. CNN-
design and great effort in writing complex RTL codes, prac- based image classification applications were accelerated by
ticing time-consuming simulations and compilations before PipeCNN. The processing speed and power consumption
one can actually run accelerators on hardware. As the rapid were measured and demonstrated at runtime showing scalable
development in deep learning areas, the unfriendly features of performance and cost that can meet different application
RTL-based design scheme hinder domain experts from utiliz- requirements and resource constrains.
ing FPGAs to explore new architectures for neural network
accelerators. II. A RCHITECTURE D ESIGN AND O PTIMIZATION
High-Level Synthesis (HLS) tools, which enable automatic A. Accelerator Architecture
compilation from high-level programs (C/C++) to low-level As shown in Fig. 1, PipeCNN consists of a group of
Ke Xu ([email protected]) is corresponding author of this paper. This OpenCL kernels that are cascaded by using Altera’s OpenCL
work was partially supported by NNSF of China Grant No. 61574013. extension Channels. Two data mover kernels, namely MemRD

978-1-5386-2656-6/17/$31.00 © 2017 IEEE 279 FPT 2017


sĞĐƚŽƌŝnjĞĚ/ŶƉƵƚ [ /ŶƉƵƚ&ĞĂƚƵƌĞŵĂƉ KƵƚƉƵƚ&ĞĂƚƵƌĞŵĂƉ
]
ĂĂĂ \ h D
/ŶƉƵƚŽŶŶĞĐƚŝŽŶ D
tĞŝŐŚƚ

п п ĂĂĂ п п 
sĞĐƚŽƌŝnjĞĚ
н ĂĂĂ
н sĞĐƚŽƌŝnjĞĚ KƵƚƉƵƚ
h /ŶƉƵƚ
h
h
н н <
h hͺEhD
WŝƉĞůŝŶĞĚDƵůƚŝƉůĞƌͲ sͺ^/
;tͲ<Ϳͬ^нϭ
н ĚĚĞƌdƌĞĞ <

ϯͲŽŶǀŽůƵƚŝŽŶ ΀;tͲ<Ϳͬ^нϭ΁h<
н ;>ŽĐĂůtŽƌŬͲ'ƌŽƵƉͿ DĞŵZ<ĞƌŶĞůEZĂŶŐĞ DĞŵtZ<ĞƌŶĞůEZĂŶŐĞ

ĞůĂLJĞĚƵĨĨĞƌ
KƵƚƉƵƚƵĨĨĞƌ
Fig. 3. Data and work-item mapping scheme of the data mover kernels.

sĞĐƚŽƌŝnjĞĚ sĞĐƚŽƌŝnjĞĚ
tĞŝŐŚƚƐ &ĞĂƚƵƌĞƐ
multiplier-adder tree with a delayed buffer is generated by the
Fig. 2. The hardware architecture of the convolution kernel. compiler as Fig. 2 shows. When an appropriate buffer depth
is selected, the proposed structure can be efficiently pipelined
by the OpenCL compiler with an initial interval of only one
and MemWR, transfer feature map and weight data from/to the clock cycle. Each convolution pipeline constitutes a compute
global memory (i.e., the external DDR memory) feeding other unit (CU) and the kernel consists of multiple CUs to perform
kernels with high throughput data streams. The Convolution parallel convolutions.
kernel (Conv.) is designed to accelerate the most compute- 2) Data Mover Kernels: Two multi-mode 3-D NDRange
intensive computations in CNNs, i.e., convolution layer and the kernels are designed to fetch/store data from/to the global
FC layer. The Pooling kernel performs subsampling operations memory for the computation pipelines. Data and work-item
directly on the output data stream of the Conv. kernel. The mapping schemes are illustrated in Fig. 3. In convolution
Local Response Normalization (LRN) kernel fetches data from mode, the MemRD kernel launches with a global work-item
global memory and performs normalization on the feature map number of ([(W − K)/S + 1] × K, [(W − K)/S + 1] ×
of neighboring neurons [1]. This architecture has the following K, C  × M ), while the MemWR kernel works in an NDRange
advantages: 1) the cascaded kernels form a deep pipeline, of ((W − K)/S + 1, (W − K)/S + 1, M ). Variables W and
which can execute a serial of basic CNN operations without H represent the width and height of the input feature map,
the need of storing interlayer data back to external memory. It while S denotes the stride of each filtering operations. To
significantly relieves the demand on memory bandwidth which enable concurrent work-group processing, the work-items are
is essential for embedded FPGAs. 2) we use a single hardware arranged into multiple concurrent work-groups, each of which
kernel to implement both the convolution and FC layers which has a local work-group size of (K, K, C’).
further improves the efficiency of hardware resource utiliza- In FC mode, both the input feature and weight data are 1-
tions. Detailed kernel designs and corresponding optimization D vectors as defined in Eq. (2). Directly launching MemRD
schemes are as follow: kernel with only one classification task will reduce the op-
1) Convolution Kernel: The convolution operation is es- portunity of data reuse in weights. Therefore, we introduce
sentially a 3-dimensional (3-D) multiply-accumulate (MAC) batched processing capability in MemRD. For instance, a
operation that can be defined by batch of 64 classification tasks can be processed with a single
kernel launch by mapping all the input feature maps as a single
Cl K−1 K−1
Do (fo , y, x) =
  
Wl (fo , fi , ky , kx ) · Di (fi , y + ky , x + kx )
3-D data set with the NDRange size of (8, 8, C).
fi =1 ky =0 kx =0 3) Other Kernels: Besides the most compute-intensive con-
(1) volution kernel, we also designed other OpenCL kernels to
where Di (fi , y, x) and Do (fo , y, x) denote the neurons at po- accelerate the widely used layer operations in CNNs, such
sition (x, y) in the input feature map fi and output feature map as pooling, LRN and etc. Therefore, PipeCNN can process
fo , respectively. Wl (fo , fi , y, x) represents the corresponding the complete CNN forward computation flow with very little
weights in the l-th layer that gets convolved with fi . The involvement of host CPU, resulting in high throughput and
size of the convolution filters is K × K, while the total low latency.
number of input feature maps is Cl . In this paper, we propose
to implement (1) by using a HLS-friendly 1-D convolution B. Performance and Bandwidth Optimizations
structure which flattened the 3-D convolution as follow: 1) Throughput Optimization: To further improve the
Cl ×K×K
 throughput of the convolution kernel, data vectorization and
 
Do (fo ) = Wl (fo , fi ) · Di (fi ) (2) parallel CUs are introduced. As shown in Fig. 3, the input
f  =1
i features Di and weights Wl at same the position (x, y)
In this way, nested-loops can be avoided in kernel code, from adjacent feature maps are grouped as one vectorized
and an efficient convolution pipeline structure consisted of a input. The size of the vectorized data is controlled by the

280
ĨŝůƚĞƌƐƚƌŝĚĞ dŽƚĂůůLJ&dͺEhDŽĨ
ĨŝůƚĞƌǁŝŶĚŽǁƐ ^ĞƚƚĂƌŐĞƚďŝƚͲǁŝĚƚŚt͕ŝŶĂŶĚŽƵƚ
^ ĨŝůƚĞƌǁŝŶĚŽǁ
^Ğƚ&ŝŶĂŶĚ&ŽƵƚ
>ŽĂĚŶĞƚǁŽƌŬŵŽĚĞů WĞƌĨŽƌŵYƵĂŶƚŝnjĂƚŝŽŶ
< sĞƌŝĨLJƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐLJ
/ŶŝƚŝĂůŝnjŝŶŐ&ǁ͕&ŝŶĂŶĚ&ŽƵƚ
< EŽ
DĞĞƚ'ŽĂů͍
ĚĂƚĂƌĞƵƐĞĚ ŶĂůLJnjĞƚŚĞĚĂƚĂŽĨũͲƚŚůĂLJĞƌ hƉĚĂƚĞ
zĞƐ &ŝŶ͕&ŽƵƚ
^Ğƚ&ǁ hƉĚĂƚĞŵŽĚĞůǁĞŝŐŚƚ
ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ WĞƌĨŽƌŵYƵĂŶƚŝnjĂƚŝŽŶ
ƉŽƐŝƚŝŽŶϭ ƉŽƐŝƚŝŽŶϮ ƉŽƐŝƚŝŽŶϯ sĞƌŝĨLJƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐLJ EŽ
>ĂƐƚůĂLJĞƌ͍
EŽ ũнн
Fig. 4. Sliding-window-based data buffering scheme. DĞĞƚ'ŽĂů͍ zĞƐ
hƉĚĂƚĞ&ǁ
zĞƐ ^ĂǀĞŵŽĚĞů

design parameter VEC SIZE. The vectorized data streams are Fig. 5. Fixed-point model quantization flow used in this demo.
fetched by the MemRD kernel and send to multiple CUs in
the Conv. kernel by OpenCL Channels as the colored line
dŚĞƌĞƐƵůƚƐŽĨƚŚĞŝŵĂŐĞ
shows. The number of parallel CUs used is controlled by ĐůĂƐƐŝĨŝĐĂƚŝŽŶĂƉƉůŝĐĂƚŝŽŶĂƌĞ
ƐŚŽǁŶŽŶƚŚĞƐĐƌĞĞŶ
another parameter CU NUM. Simply changing the value of
the parameters VEC SIZE and CU NUM, the implemented
design can achieve scalable performance and hardware cost
without the need of modifying the kernel code. In the final
design, two 8 × 8 multipliers were also grouped and mapped
into one DSP block by manually inserting Altera’s IP blocks
in the kernel code to improve the efficiency of the pipeline.
2) Bandwidth Optimization: To relieve the pressure on
external memory bandwidth, we introduce a sliding-window- WŽǁĞƌDĞƚĞƌ

based data buffering scheme. As shown in Fig. 4, the filter


stride S of the convolution window is usually smaller than ĂŵĞƌĂ
the filter size K (in most cases, S = 1). Therefore, a
large portion of data can be reused during the convolution ϭͲ^ŽŽĂƌĚ

computation. To exploiting data reuse, the MemRD kernel


fetches a window of data that covers the area of FT NUM Fig. 7. Demo setup of the DE1-SoC board. The power dissipation was
of convolution filters each time, and caches the data in the measured at runtime when performing image classification application.
on-chip buffers. And then, for successive convolution filtering
operations, feature-map data and weight are repeatedly loaded
from local memories avoiding access of external memory. To bound at the same time. In this demo, the AlexNet and VGG-
demonstrate the effectiveness of this scheme, we profiled the 16 models are quantized with 8-bit word length with less than
DDR memory bandwidth of implementations with different 1% loss in top1/5 accuracy.
values of FT NUM on different FPGA platforms. The average
III. D EMONSTRATION AND E VALUATION
bandwidth reductions achieved reached up to 50%.
3) Fixed-point Optimization: Implementing fixed-point A. Demonstration Setup
arithmetics instead of floating-point computations on FPGAs In the demonstration, we implemented CNN-based image
can significantly reduce hardware costs and memory band- classification applications on three different FPGA platforms.
width requirements. In this paper, we quantize both the model The detailed information are summarized in Table I. The DE5-
and intermediate computation results with a uniformed word net and DE5a-net boards were installed in a desktop computer,
length and variable fractional bit-width for each CNN layers. which was equipped with Intel i5-4690K CPU and 64GB
Fig. 5 illustrates how data quantization is performed in this memories, while the DE1-SoC board was connected to the
demo. In each layer, the weight, input feature map and output computer and accessed through VNC viewer. The OpenCL
feature map data are presented in fixed-point numbers as kernel codes are compiled by using Altera OpenCL SDK
Qw · 2−Fw , Qin · 2−Fin and Qout · 2−Fout , respectively. The v16.1. The host programs first load images from hard disks or
variable Q denotes the fixed-point binary word with B-bit web camera, and then send them to the FPGA accelerators to
length while F denotes a bias presenting the number of perform CNN forward computations. Two large-scale CNN
fractional bits of the fixed-point number. The quantization models: AlexNet (8 layers) and VGG-16 (16 layers) were
flow searches for the group of parameters Bw , Fin and Fout used as benchmarks to measure the performance. For each
that minimizes the hardware cost while satisfying the accuracy board, an external power meter was used to measure the power

281
280 1200 350
120k
260 VEC_SIZE=4 Device VEC_SIZE=4
VEC_SIZE=4 1100 VEC_SIZE=8
VEC_SIZE=4
115k VEC_SIZE=8 VEC_SIZE=8 limit VEC_SIZE=8
240 VEC_SIZE=16 VEC_SIZE=16 300
VEC_SIZE=16 1000 VEC_SIZE=16
110k 220
200 900 250

Excution Time (ms)


105k
Logic Elements

180 800

DSP Blocks
200

M20K
100k 160
700
140
95k 150
600
120
90k 100 500
100
85k 80 400
60 50
80k 300
40
75k 20 200 0
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 4 6 8 10 12 14 16 2 6 10 14 18 22 26 30
Number of CUs Number of CUs Number of CUs Number of CUs

(a) (b) (c) (d)

Fig. 6. Design space exploration for PipeCNN on the DE5-net FPGA platform. The execution time was measured by using AlexNet model.

TABLE I
S UMMARY OF THE MEASURED PERFORMANCE , COST AND POWER CONSUMPTION ON DIFFERENT PLATFORMS .

FPGA Resource Resource Execution Time


Platform Frequency Board Power
Type Capacity Consumed AlexNet VGG-16
Cyclone-V 85K LEs 45K LEs
DE1-soc 140ms 1,928ms 122Mhz 2.1W
SEA5 87 DSPs 68 DSPs
Stratix-V 622K LEs 112K LEs
DE5-net 15ms 254ms 198Mhz 27W
GXA7 256 DSPs 247 DSPs
Arria-10 1,150K LEs 322K LEs
DE5a-net 5ms 110ms 218Mhz 26W
GX1150 1,518 DSPs 683 DSPs

consumption at runtime. Fig 7 shows how the DE1-SoC board CNN accelerator on Arria-10 FPGA. The design adopted
was setup for the demonstration. Winograd transformations to reduce the number of compu-
tations required by the convolution layer, and thus achieved
B. Design Space Exploration much higher performance (i.e., 1020 fps for AlexNet) than
As discussed in Section II-B, two design parameters ours. In future works, we will explore sparse convolution
VEC SIZE and CU NUM are used to control the through- algorithms to further improve the performance of PipeCNN.
put and hardware cost of the FPGA accelerator. Therefore,
design space explorations can be quantitatively performed by IV. C ONCLUSION
implementing the accelerator with different parameter configu- This paper demonstrated an open-source OpenCL-based FP-
rations. Fig. 6 illustrates the exploration results on the DE5-net GA accelerator for convolutional neural networks. An efficient
platform. It can be observed from Fig. 6-(b) and (d) that the hardware architecture with pipelined kernels was presented.
accelerator with parameters VEC SIZE=16 and CU NUM=28 Throughput and memory bandwidth optimization schemes
maximizes the DSP utilization and uses the shortest time for were also discussed. The implemented design show scalable
image classification. performance and cost on multiple FPGA platforms.

C. Performance Evaluation R EFERENCES


For each platform, we have performed the design space [1] A. Krizhevsky, I. Sutskever, G. E. Hinton and et al., “ImageNet
classification with deep convolutional neural networks,” in Proc. Neural
exploration and found out the implementation that achieved Information Processing Systems (NIPS’12), 2012.
the best performance. The results are summarized in Table I. [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
For instance, the best performance that PipeCNN achieved large-scale image recognition,” arXiv:1409.1556, 2014.
[3] U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling and G.R. Chiu, “An
on the DE5-net platform is 67 fps (15 ms/img) for AlexNet OpenCL Deep Learning Accelerator on Arria 10” in Proc. ACM/SIGDA
model. To demonstrate how fast PipeCNN can accelerate International Symposium on Field-Programmable Gate Arrays (FPGA
CNN computations, we also performed image classifications ’17), 2017.
[4] J. Qiu, J. Wang, S. Yao and et al., “Going deeper with embedded
on CPU by using the Caffe tool installed on our desktop FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA
computer. The execution time for AlexNet and VGG-16 is International Symposium on Field-Programmable Gate Arrays (FPGA
189 ms and 1547 ms, respectively. We can see that using ’16), 2016.
[5] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, “DLAU: a
FPGA-based accelerator can achieve up to 37× performance scalable deep learning accelerator unit on FPGA,” IEEE Transactions
improvement in implementing CNN-based image classification on Computer-Aided Design of Integrated Circuits and Systems, 2016.
applications. The work of [3] also presents an OpenCL-based [6] https://github.com/doonny/PipeCNN.

282

You might also like