FPT2017 PipeCNN
FPT2017 PipeCNN
Ă
Ă
Ă
efficiency over GPU, especially when OpenCL-based high-level
synthesis tools are now available providing fast verification and
implementation flows. In this paper, we demonstrate PipeCNN
– an efficient FPGA accelerator that can be implemented on a
variety of FPGA platforms with reconfigurable performance and
cost. The PipeCNN project is openly accessible, and thus can 'ůŽďĂůDĞŵŽƌLJ
be used either by researchers as a generic framework to explore
new hardware architectures or by teachers as a off-the-self design EZĂŶŐĞ<ĞƌŶĞů ^ŝŶŐůĞͲƚŚƌĞĂĚĞĚ ŚĂŶŶĞůͬWŝƉĞƐ
example for any academic courses related to FPGAs. <ĞƌŶĞů
п п ĂĂĂ п п
sĞĐƚŽƌŝnjĞĚ
н ĂĂĂ
н sĞĐƚŽƌŝnjĞĚ KƵƚƉƵƚ
h /ŶƉƵƚ
h
h
н н <
h hͺEhD
WŝƉĞůŝŶĞĚDƵůƚŝƉůĞƌͲ sͺ^/
;tͲ<Ϳͬ^нϭ
н ĚĚĞƌdƌĞĞ <
ϯͲŽŶǀŽůƵƚŝŽŶ ;tͲ<Ϳͬ^нϭh<
н ;>ŽĐĂůtŽƌŬͲ'ƌŽƵƉͿ DĞŵZ<ĞƌŶĞůEZĂŶŐĞ DĞŵtZ<ĞƌŶĞůEZĂŶŐĞ
ĞůĂLJĞĚƵĨĨĞƌ
KƵƚƉƵƚƵĨĨĞƌ
Fig. 3. Data and work-item mapping scheme of the data mover kernels.
sĞĐƚŽƌŝnjĞĚ sĞĐƚŽƌŝnjĞĚ
tĞŝŐŚƚƐ &ĞĂƚƵƌĞƐ
multiplier-adder tree with a delayed buffer is generated by the
Fig. 2. The hardware architecture of the convolution kernel. compiler as Fig. 2 shows. When an appropriate buffer depth
is selected, the proposed structure can be efficiently pipelined
by the OpenCL compiler with an initial interval of only one
and MemWR, transfer feature map and weight data from/to the clock cycle. Each convolution pipeline constitutes a compute
global memory (i.e., the external DDR memory) feeding other unit (CU) and the kernel consists of multiple CUs to perform
kernels with high throughput data streams. The Convolution parallel convolutions.
kernel (Conv.) is designed to accelerate the most compute- 2) Data Mover Kernels: Two multi-mode 3-D NDRange
intensive computations in CNNs, i.e., convolution layer and the kernels are designed to fetch/store data from/to the global
FC layer. The Pooling kernel performs subsampling operations memory for the computation pipelines. Data and work-item
directly on the output data stream of the Conv. kernel. The mapping schemes are illustrated in Fig. 3. In convolution
Local Response Normalization (LRN) kernel fetches data from mode, the MemRD kernel launches with a global work-item
global memory and performs normalization on the feature map number of ([(W − K)/S + 1] × K, [(W − K)/S + 1] ×
of neighboring neurons [1]. This architecture has the following K, C × M ), while the MemWR kernel works in an NDRange
advantages: 1) the cascaded kernels form a deep pipeline, of ((W − K)/S + 1, (W − K)/S + 1, M ). Variables W and
which can execute a serial of basic CNN operations without H represent the width and height of the input feature map,
the need of storing interlayer data back to external memory. It while S denotes the stride of each filtering operations. To
significantly relieves the demand on memory bandwidth which enable concurrent work-group processing, the work-items are
is essential for embedded FPGAs. 2) we use a single hardware arranged into multiple concurrent work-groups, each of which
kernel to implement both the convolution and FC layers which has a local work-group size of (K, K, C’).
further improves the efficiency of hardware resource utiliza- In FC mode, both the input feature and weight data are 1-
tions. Detailed kernel designs and corresponding optimization D vectors as defined in Eq. (2). Directly launching MemRD
schemes are as follow: kernel with only one classification task will reduce the op-
1) Convolution Kernel: The convolution operation is es- portunity of data reuse in weights. Therefore, we introduce
sentially a 3-dimensional (3-D) multiply-accumulate (MAC) batched processing capability in MemRD. For instance, a
operation that can be defined by batch of 64 classification tasks can be processed with a single
kernel launch by mapping all the input feature maps as a single
Cl K−1 K−1
Do (fo , y, x) =
Wl (fo , fi , ky , kx ) · Di (fi , y + ky , x + kx )
3-D data set with the NDRange size of (8, 8, C).
fi =1 ky =0 kx =0 3) Other Kernels: Besides the most compute-intensive con-
(1) volution kernel, we also designed other OpenCL kernels to
where Di (fi , y, x) and Do (fo , y, x) denote the neurons at po- accelerate the widely used layer operations in CNNs, such
sition (x, y) in the input feature map fi and output feature map as pooling, LRN and etc. Therefore, PipeCNN can process
fo , respectively. Wl (fo , fi , y, x) represents the corresponding the complete CNN forward computation flow with very little
weights in the l-th layer that gets convolved with fi . The involvement of host CPU, resulting in high throughput and
size of the convolution filters is K × K, while the total low latency.
number of input feature maps is Cl . In this paper, we propose
to implement (1) by using a HLS-friendly 1-D convolution B. Performance and Bandwidth Optimizations
structure which flattened the 3-D convolution as follow: 1) Throughput Optimization: To further improve the
Cl ×K×K
throughput of the convolution kernel, data vectorization and
Do (fo ) = Wl (fo , fi ) · Di (fi ) (2) parallel CUs are introduced. As shown in Fig. 3, the input
f =1
i features Di and weights Wl at same the position (x, y)
In this way, nested-loops can be avoided in kernel code, from adjacent feature maps are grouped as one vectorized
and an efficient convolution pipeline structure consisted of a input. The size of the vectorized data is controlled by the
280
ĨŝůƚĞƌƐƚƌŝĚĞ dŽƚĂůůLJ&dͺEhDŽĨ
ĨŝůƚĞƌǁŝŶĚŽǁƐ ^ĞƚƚĂƌŐĞƚďŝƚͲǁŝĚƚŚt͕ŝŶĂŶĚŽƵƚ
^ ĨŝůƚĞƌǁŝŶĚŽǁ
^Ğƚ&ŝŶĂŶĚ&ŽƵƚ
>ŽĂĚŶĞƚǁŽƌŬŵŽĚĞů WĞƌĨŽƌŵYƵĂŶƚŝnjĂƚŝŽŶ
< sĞƌŝĨLJƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐLJ
/ŶŝƚŝĂůŝnjŝŶŐ&ǁ͕&ŝŶĂŶĚ&ŽƵƚ
< EŽ
DĞĞƚ'ŽĂů͍
ĚĂƚĂƌĞƵƐĞĚ ŶĂůLJnjĞƚŚĞĚĂƚĂŽĨũͲƚŚůĂLJĞƌ hƉĚĂƚĞ
zĞƐ &ŝŶ͕&ŽƵƚ
^Ğƚ&ǁ hƉĚĂƚĞŵŽĚĞůǁĞŝŐŚƚ
ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ WĞƌĨŽƌŵYƵĂŶƚŝnjĂƚŝŽŶ
ƉŽƐŝƚŝŽŶϭ ƉŽƐŝƚŝŽŶϮ ƉŽƐŝƚŝŽŶϯ sĞƌŝĨLJƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐLJ EŽ
>ĂƐƚůĂLJĞƌ͍
EŽ ũнн
Fig. 4. Sliding-window-based data buffering scheme. DĞĞƚ'ŽĂů͍ zĞƐ
hƉĚĂƚĞ&ǁ
zĞƐ ^ĂǀĞŵŽĚĞů
design parameter VEC SIZE. The vectorized data streams are Fig. 5. Fixed-point model quantization flow used in this demo.
fetched by the MemRD kernel and send to multiple CUs in
the Conv. kernel by OpenCL Channels as the colored line
dŚĞƌĞƐƵůƚƐŽĨƚŚĞŝŵĂŐĞ
shows. The number of parallel CUs used is controlled by ĐůĂƐƐŝĨŝĐĂƚŝŽŶĂƉƉůŝĐĂƚŝŽŶĂƌĞ
ƐŚŽǁŶŽŶƚŚĞƐĐƌĞĞŶ
another parameter CU NUM. Simply changing the value of
the parameters VEC SIZE and CU NUM, the implemented
design can achieve scalable performance and hardware cost
without the need of modifying the kernel code. In the final
design, two 8 × 8 multipliers were also grouped and mapped
into one DSP block by manually inserting Altera’s IP blocks
in the kernel code to improve the efficiency of the pipeline.
2) Bandwidth Optimization: To relieve the pressure on
external memory bandwidth, we introduce a sliding-window- WŽǁĞƌDĞƚĞƌ
281
280 1200 350
120k
260 VEC_SIZE=4 Device VEC_SIZE=4
VEC_SIZE=4 1100 VEC_SIZE=8
VEC_SIZE=4
115k VEC_SIZE=8 VEC_SIZE=8 limit VEC_SIZE=8
240 VEC_SIZE=16 VEC_SIZE=16 300
VEC_SIZE=16 1000 VEC_SIZE=16
110k 220
200 900 250
180 800
DSP Blocks
200
M20K
100k 160
700
140
95k 150
600
120
90k 100 500
100
85k 80 400
60 50
80k 300
40
75k 20 200 0
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 4 6 8 10 12 14 16 2 6 10 14 18 22 26 30
Number of CUs Number of CUs Number of CUs Number of CUs
Fig. 6. Design space exploration for PipeCNN on the DE5-net FPGA platform. The execution time was measured by using AlexNet model.
TABLE I
S UMMARY OF THE MEASURED PERFORMANCE , COST AND POWER CONSUMPTION ON DIFFERENT PLATFORMS .
consumption at runtime. Fig 7 shows how the DE1-SoC board CNN accelerator on Arria-10 FPGA. The design adopted
was setup for the demonstration. Winograd transformations to reduce the number of compu-
tations required by the convolution layer, and thus achieved
B. Design Space Exploration much higher performance (i.e., 1020 fps for AlexNet) than
As discussed in Section II-B, two design parameters ours. In future works, we will explore sparse convolution
VEC SIZE and CU NUM are used to control the through- algorithms to further improve the performance of PipeCNN.
put and hardware cost of the FPGA accelerator. Therefore,
design space explorations can be quantitatively performed by IV. C ONCLUSION
implementing the accelerator with different parameter configu- This paper demonstrated an open-source OpenCL-based FP-
rations. Fig. 6 illustrates the exploration results on the DE5-net GA accelerator for convolutional neural networks. An efficient
platform. It can be observed from Fig. 6-(b) and (d) that the hardware architecture with pipelined kernels was presented.
accelerator with parameters VEC SIZE=16 and CU NUM=28 Throughput and memory bandwidth optimization schemes
maximizes the DSP utilization and uses the shortest time for were also discussed. The implemented design show scalable
image classification. performance and cost on multiple FPGA platforms.
282