0% found this document useful (0 votes)

21 views4 pages

FPT2017 PipeCNN

Uploaded by

chienphan852003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views4 pages

FPT2017 PipeCNN

Uploaded by

chienphan852003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

PipeCNN: An OpenCL-Based Open-Source FPGA

Accelerator for Convolution Neural Networks

Dong Wang, Ke Xu and Diankun Jiang
Institute of Information Science
Beijing Jiaotong University
Beijing 100044, China
Email: {wangdong, 17112071, 16125141}@bjtu.edu.cn

Abstract—Convolutional neural networks (CNNs) have been

employed in many applications, such as image classiﬁcation, video ĞĞƉůǇWŝƉĞůŝŶĞĚKƉĞŶ><ĞƌŶĞůƐ
analysis and speech recognition. Being compute-intensive, CNNs
are widely accelerated by GPUs with high power dissipations.
Recently, studies were carried out exploiting FPGA as CNN ac-
celerator because of its reconﬁgurability and advantage on energy DĞŵZ ŽŶǀ͘ WŽŽůŝŶŐ DĞŵtZ >ZE

Ă
efficiency over GPU, especially when OpenCL-based high-level
synthesis tools are now available providing fast verification and
implementation flows. In this paper, we demonstrate PipeCNN
– an efficient FPGA accelerator that can be implemented on a
variety of FPGA platforms with reconfigurable performance and
cost. The PipeCNN project is openly accessible, and thus can 'ůŽďĂůDĞŵŽƌǇ
be used either by researchers as a generic framework to explore
new hardware architectures or by teachers as a off-the-self design EZĂŶŐĞ<ĞƌŶĞů ^ŝŶŐůĞͲƚŚƌĞĂĚĞĚ ŚĂŶŶĞůͬWŝƉĞƐ
example for any academic courses related to FPGAs. <ĞƌŶĞů

I. I NTRODUCTION Fig. 1. The top-level architecture of PipeCNN.

Convolutional neural network (CNN) [1], [2], as an emerg-
ing deep learning architecture, has received huge attentions
in various applications, such as video surveillance, image RTL specifications, have became increasingly popular in both
searching, speech recognition, and robot vision. Currently, academic and industrial fields. Compared with traditional
GPUs are widely adopted as hardware accelerators for training methodology, the HLS tools provide faster hardware develop-
deep neuron networks. Yet they are generally energy inefficient ment cycle and software-friendly program interfaces that can
for embedded applications. FPGAs, which provide massive be easily integrated with user applications [3].
processing elements, reconfigurable interconnections and low- In this paper, we introduce PipeCNN, an efficient OpenCL-
er power dissipation, are naturally suitable to implement neural based CNN accelerator on FPGAs. A set of configurable
network circuits. Moreover, FPGAs are also flexible with OpenCL kernels are designed to accelerate a wide range of
reduced data precision at circuit level, which will reduce the neural network models. Throughput and memory bandwidth
memory footprint and bandwidth requirements, resulting in optimization schemes are also presented and discussed. All
better energy efficiency than GPUs. the design files are openly accessible and can be downloaded
Studies, such as [4], [5], have reported efficient CNN ac- from [6].
celerators on embedded FPGA platforms. However, traditional In the final demo, PipeCNN was implemented and evalu-
register-transfer-level (RTL) design flows adopted in these ated on three different FPGA platforms, including Cyclone-
studies require deep background knowledge in digital circuit V SEA5 SoC, Stratix-V GXA7 and Arria-10 AX115. CNN-
design and great effort in writing complex RTL codes, prac- based image classification applications were accelerated by
ticing time-consuming simulations and compilations before PipeCNN. The processing speed and power consumption
one can actually run accelerators on hardware. As the rapid were measured and demonstrated at runtime showing scalable
development in deep learning areas, the unfriendly features of performance and cost that can meet different application
RTL-based design scheme hinder domain experts from utiliz- requirements and resource constrains.
ing FPGAs to explore new architectures for neural network
accelerators. II. A RCHITECTURE D ESIGN AND O PTIMIZATION
High-Level Synthesis (HLS) tools, which enable automatic A. Accelerator Architecture
compilation from high-level programs (C/C++) to low-level As shown in Fig. 1, PipeCNN consists of a group of
Ke Xu ([email protected]) is corresponding author of this paper. This OpenCL kernels that are cascaded by using Altera’s OpenCL
work was partially supported by NNSF of China Grant No. 61574013. extension Channels. Two data mover kernels, namely MemRD

978-1-5386-2656-6/17/$31.00 © 2017 IEEE 279 FPT 2017

sĞĐƚŽƌŝǌĞĚ/ŶƉƵƚ [ /ŶƉƵƚ&ĞĂƚƵƌĞŵĂƉ KƵƚƉƵƚ&ĞĂƚƵƌĞŵĂƉ
]
ĂĂĂ \ h D
/ŶƉƵƚŽŶŶĞĐƚŝŽŶ D
tĞŝŐŚƚ

п п ĂĂĂ п п
sĞĐƚŽƌŝǌĞĚ
н ĂĂĂ
н sĞĐƚŽƌŝǌĞĚ KƵƚƉƵƚ
h /ŶƉƵƚ
h
h
н н <
h hͺEhD
WŝƉĞůŝŶĞĚDƵůƚŝƉůĞƌͲ sͺ^/
;tͲ<Ϳͬ^нϭ
н ĚĚĞƌdƌĞĞ <

ϯͲŽŶǀŽůƵƚŝŽŶ ΀;tͲ<Ϳͬ^нϭ΁h<
н ;>ŽĐĂůtŽƌŬͲ'ƌŽƵƉͿ DĞŵZ<ĞƌŶĞůEZĂŶŐĞ DĞŵtZ<ĞƌŶĞůEZĂŶŐĞ

ĞůĂǇĞĚƵĨĨĞƌ
KƵƚƉƵƚƵĨĨĞƌ
Fig. 3. Data and work-item mapping scheme of the data mover kernels.

sĞĐƚŽƌŝǌĞĚ sĞĐƚŽƌŝǌĞĚ
tĞŝŐŚƚƐ &ĞĂƚƵƌĞƐ
multiplier-adder tree with a delayed buffer is generated by the
Fig. 2. The hardware architecture of the convolution kernel. compiler as Fig. 2 shows. When an appropriate buffer depth
is selected, the proposed structure can be efficiently pipelined
by the OpenCL compiler with an initial interval of only one
and MemWR, transfer feature map and weight data from/to the clock cycle. Each convolution pipeline constitutes a compute
global memory (i.e., the external DDR memory) feeding other unit (CU) and the kernel consists of multiple CUs to perform
kernels with high throughput data streams. The Convolution parallel convolutions.
kernel (Conv.) is designed to accelerate the most compute- 2) Data Mover Kernels: Two multi-mode 3-D NDRange
intensive computations in CNNs, i.e., convolution layer and the kernels are designed to fetch/store data from/to the global
FC layer. The Pooling kernel performs subsampling operations memory for the computation pipelines. Data and work-item
directly on the output data stream of the Conv. kernel. The mapping schemes are illustrated in Fig. 3. In convolution
Local Response Normalization (LRN) kernel fetches data from mode, the MemRD kernel launches with a global work-item
global memory and performs normalization on the feature map number of ([(W − K)/S + 1] × K, [(W − K)/S + 1] ×
of neighboring neurons [1]. This architecture has the following K, C × M ), while the MemWR kernel works in an NDRange
advantages: 1) the cascaded kernels form a deep pipeline, of ((W − K)/S + 1, (W − K)/S + 1, M ). Variables W and
which can execute a serial of basic CNN operations without H represent the width and height of the input feature map,
the need of storing interlayer data back to external memory. It while S denotes the stride of each filtering operations. To
significantly relieves the demand on memory bandwidth which enable concurrent work-group processing, the work-items are
is essential for embedded FPGAs. 2) we use a single hardware arranged into multiple concurrent work-groups, each of which
kernel to implement both the convolution and FC layers which has a local work-group size of (K, K, C’).
further improves the efficiency of hardware resource utiliza- In FC mode, both the input feature and weight data are 1-
tions. Detailed kernel designs and corresponding optimization D vectors as defined in Eq. (2). Directly launching MemRD
schemes are as follow: kernel with only one classification task will reduce the op-
1) Convolution Kernel: The convolution operation is es- portunity of data reuse in weights. Therefore, we introduce
sentially a 3-dimensional (3-D) multiply-accumulate (MAC) batched processing capability in MemRD. For instance, a
operation that can be defined by batch of 64 classification tasks can be processed with a single
kernel launch by mapping all the input feature maps as a single
Cl K−1 K−1
Do (fo , y, x) =

Wl (fo , fi , ky , kx ) · Di (fi , y + ky , x + kx )
3-D data set with the NDRange size of (8, 8, C).
fi =1 ky =0 kx =0 3) Other Kernels: Besides the most compute-intensive con-
(1) volution kernel, we also designed other OpenCL kernels to
where Di (fi , y, x) and Do (fo , y, x) denote the neurons at po- accelerate the widely used layer operations in CNNs, such
sition (x, y) in the input feature map fi and output feature map as pooling, LRN and etc. Therefore, PipeCNN can process
fo , respectively. Wl (fo , fi , y, x) represents the corresponding the complete CNN forward computation flow with very little
weights in the l-th layer that gets convolved with fi . The involvement of host CPU, resulting in high throughput and
size of the convolution filters is K × K, while the total low latency.
number of input feature maps is Cl . In this paper, we propose
to implement (1) by using a HLS-friendly 1-D convolution B. Performance and Bandwidth Optimizations
structure which flattened the 3-D convolution as follow: 1) Throughput Optimization: To further improve the
Cl ×K×K
throughput of the convolution kernel, data vectorization and

Do (fo ) = Wl (fo , fi ) · Di (fi ) (2) parallel CUs are introduced. As shown in Fig. 3, the input
f =1
i features Di and weights Wl at same the position (x, y)
In this way, nested-loops can be avoided in kernel code, from adjacent feature maps are grouped as one vectorized
and an efficient convolution pipeline structure consisted of a input. The size of the vectorized data is controlled by the

280
ĨŝůƚĞƌƐƚƌŝĚĞ dŽƚĂůůǇ&dͺEhDŽĨ
ĨŝůƚĞƌǁŝŶĚŽǁƐ ^ĞƚƚĂƌŐĞƚďŝƚͲǁŝĚƚŚt͕ŝŶĂŶĚŽƵƚ
^ ĨŝůƚĞƌǁŝŶĚŽǁ
^Ğƚ&ŝŶĂŶĚ&ŽƵƚ
>ŽĂĚŶĞƚǁŽƌŬŵŽĚĞů WĞƌĨŽƌŵYƵĂŶƚŝǌĂƚŝŽŶ
< sĞƌŝĨǇƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐǇ
/ŶŝƚŝĂůŝǌŝŶŐ&ǁ͕&ŝŶĂŶĚ&ŽƵƚ
< EŽ
DĞĞƚ'ŽĂů͍
ĚĂƚĂƌĞƵƐĞĚ ŶĂůǇǌĞƚŚĞĚĂƚĂŽĨũͲƚŚůĂǇĞƌ hƉĚĂƚĞ
zĞƐ &ŝŶ͕&ŽƵƚ
^Ğƚ&ǁ hƉĚĂƚĞŵŽĚĞůǁĞŝŐŚƚ
ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ ƐůŝĚŝŶŐͲǁŝŶĚŽǁ WĞƌĨŽƌŵYƵĂŶƚŝǌĂƚŝŽŶ
ƉŽƐŝƚŝŽŶϭ ƉŽƐŝƚŝŽŶϮ ƉŽƐŝƚŝŽŶϯ sĞƌŝĨǇƚŽƉͲϭͬƚŽƉͲϱĂĐĐƵƌĂĐǇ EŽ
>ĂƐƚůĂǇĞƌ͍
EŽ ũнн
Fig. 4. Sliding-window-based data buffering scheme. DĞĞƚ'ŽĂů͍ zĞƐ
hƉĚĂƚĞ&ǁ
zĞƐ ^ĂǀĞŵŽĚĞů

design parameter VEC SIZE. The vectorized data streams are Fig. 5. Fixed-point model quantization flow used in this demo.
fetched by the MemRD kernel and send to multiple CUs in
the Conv. kernel by OpenCL Channels as the colored line
dŚĞƌĞƐƵůƚƐŽĨƚŚĞŝŵĂŐĞ
shows. The number of parallel CUs used is controlled by ĐůĂƐƐŝĨŝĐĂƚŝŽŶĂƉƉůŝĐĂƚŝŽŶĂƌĞ
ƐŚŽǁŶŽŶƚŚĞƐĐƌĞĞŶ
another parameter CU NUM. Simply changing the value of
the parameters VEC SIZE and CU NUM, the implemented
design can achieve scalable performance and hardware cost
without the need of modifying the kernel code. In the final
design, two 8 × 8 multipliers were also grouped and mapped
into one DSP block by manually inserting Altera’s IP blocks
in the kernel code to improve the efficiency of the pipeline.
2) Bandwidth Optimization: To relieve the pressure on
external memory bandwidth, we introduce a sliding-window- WŽǁĞƌDĞƚĞƌ

based data buffering scheme. As shown in Fig. 4, the ﬁlter

stride S of the convolution window is usually smaller than ĂŵĞƌĂ
the ﬁlter size K (in most cases, S = 1). Therefore, a
large portion of data can be reused during the convolution ϭͲ^ŽŽĂƌĚ

computation. To exploiting data reuse, the MemRD kernel

fetches a window of data that covers the area of FT NUM Fig. 7. Demo setup of the DE1-SoC board. The power dissipation was
of convolution filters each time, and caches the data in the measured at runtime when performing image classification application.
on-chip buffers. And then, for successive convolution filtering
operations, feature-map data and weight are repeatedly loaded
from local memories avoiding access of external memory. To bound at the same time. In this demo, the AlexNet and VGG-
demonstrate the effectiveness of this scheme, we profiled the 16 models are quantized with 8-bit word length with less than
DDR memory bandwidth of implementations with different 1% loss in top1/5 accuracy.
values of FT NUM on different FPGA platforms. The average
III. D EMONSTRATION AND E VALUATION
bandwidth reductions achieved reached up to 50%.
3) Fixed-point Optimization: Implementing fixed-point A. Demonstration Setup
arithmetics instead of floating-point computations on FPGAs In the demonstration, we implemented CNN-based image
can significantly reduce hardware costs and memory band- classification applications on three different FPGA platforms.
width requirements. In this paper, we quantize both the model The detailed information are summarized in Table I. The DE5-
and intermediate computation results with a uniformed word net and DE5a-net boards were installed in a desktop computer,
length and variable fractional bit-width for each CNN layers. which was equipped with Intel i5-4690K CPU and 64GB
Fig. 5 illustrates how data quantization is performed in this memories, while the DE1-SoC board was connected to the
demo. In each layer, the weight, input feature map and output computer and accessed through VNC viewer. The OpenCL
feature map data are presented in fixed-point numbers as kernel codes are compiled by using Altera OpenCL SDK
Qw · 2−Fw , Qin · 2−Fin and Qout · 2−Fout , respectively. The v16.1. The host programs first load images from hard disks or
variable Q denotes the fixed-point binary word with B-bit web camera, and then send them to the FPGA accelerators to
length while F denotes a bias presenting the number of perform CNN forward computations. Two large-scale CNN
fractional bits of the fixed-point number. The quantization models: AlexNet (8 layers) and VGG-16 (16 layers) were
flow searches for the group of parameters Bw , Fin and Fout used as benchmarks to measure the performance. For each
that minimizes the hardware cost while satisfying the accuracy board, an external power meter was used to measure the power

281
280 1200 350
120k
260 VEC_SIZE=4 Device VEC_SIZE=4
VEC_SIZE=4 1100 VEC_SIZE=8
VEC_SIZE=4
115k VEC_SIZE=8 VEC_SIZE=8 limit VEC_SIZE=8
240 VEC_SIZE=16 VEC_SIZE=16 300
VEC_SIZE=16 1000 VEC_SIZE=16
110k 220
200 900 250

Excution Time (ms)

105k
Logic Elements

180 800

DSP Blocks
200

M20K
100k 160
700
140
95k 150
600
120
90k 100 500
100
85k 80 400
60 50
80k 300
40
75k 20 200 0
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 4 6 8 10 12 14 16 2 6 10 14 18 22 26 30
Number of CUs Number of CUs Number of CUs Number of CUs

(a) (b) (c) (d)

Fig. 6. Design space exploration for PipeCNN on the DE5-net FPGA platform. The execution time was measured by using AlexNet model.

TABLE I
S UMMARY OF THE MEASURED PERFORMANCE , COST AND POWER CONSUMPTION ON DIFFERENT PLATFORMS .

FPGA Resource Resource Execution Time

Platform Frequency Board Power
Type Capacity Consumed AlexNet VGG-16
Cyclone-V 85K LEs 45K LEs
DE1-soc 140ms 1,928ms 122Mhz 2.1W
SEA5 87 DSPs 68 DSPs
Stratix-V 622K LEs 112K LEs
DE5-net 15ms 254ms 198Mhz 27W
GXA7 256 DSPs 247 DSPs
Arria-10 1,150K LEs 322K LEs
DE5a-net 5ms 110ms 218Mhz 26W
GX1150 1,518 DSPs 683 DSPs

consumption at runtime. Fig 7 shows how the DE1-SoC board CNN accelerator on Arria-10 FPGA. The design adopted
was setup for the demonstration. Winograd transformations to reduce the number of compu-
tations required by the convolution layer, and thus achieved
B. Design Space Exploration much higher performance (i.e., 1020 fps for AlexNet) than
As discussed in Section II-B, two design parameters ours. In future works, we will explore sparse convolution
VEC SIZE and CU NUM are used to control the through- algorithms to further improve the performance of PipeCNN.
put and hardware cost of the FPGA accelerator. Therefore,
design space explorations can be quantitatively performed by IV. C ONCLUSION
implementing the accelerator with different parameter configu- This paper demonstrated an open-source OpenCL-based FP-
rations. Fig. 6 illustrates the exploration results on the DE5-net GA accelerator for convolutional neural networks. An efficient
platform. It can be observed from Fig. 6-(b) and (d) that the hardware architecture with pipelined kernels was presented.
accelerator with parameters VEC SIZE=16 and CU NUM=28 Throughput and memory bandwidth optimization schemes
maximizes the DSP utilization and uses the shortest time for were also discussed. The implemented design show scalable
image classification. performance and cost on multiple FPGA platforms.

C. Performance Evaluation R EFERENCES

For each platform, we have performed the design space [1] A. Krizhevsky, I. Sutskever, G. E. Hinton and et al., “ImageNet
classification with deep convolutional neural networks,” in Proc. Neural
exploration and found out the implementation that achieved Information Processing Systems (NIPS’12), 2012.
the best performance. The results are summarized in Table I. [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
For instance, the best performance that PipeCNN achieved large-scale image recognition,” arXiv:1409.1556, 2014.
[3] U. Aydonat, S. O’Connell, D. Capalija, A.C. Ling and G.R. Chiu, “An
on the DE5-net platform is 67 fps (15 ms/img) for AlexNet OpenCL Deep Learning Accelerator on Arria 10” in Proc. ACM/SIGDA
model. To demonstrate how fast PipeCNN can accelerate International Symposium on Field-Programmable Gate Arrays (FPGA
CNN computations, we also performed image classifications ’17), 2017.
[4] J. Qiu, J. Wang, S. Yao and et al., “Going deeper with embedded
on CPU by using the Caffe tool installed on our desktop FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA
computer. The execution time for AlexNet and VGG-16 is International Symposium on Field-Programmable Gate Arrays (FPGA
189 ms and 1547 ms, respectively. We can see that using ’16), 2016.
[5] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, “DLAU: a
FPGA-based accelerator can achieve up to 37× performance scalable deep learning accelerator unit on FPGA,” IEEE Transactions
improvement in implementing CNN-based image classification on Computer-Aided Design of Integrated Circuits and Systems, 2016.
applications. The work of [3] also presents an OpenCL-based [6] https://github.com/doonny/PipeCNN.

282

Energy-Efficient CNN Hardware Design
No ratings yet
Energy-Efficient CNN Hardware Design
72 pages
Microsoft Excel Formulas and Functions (Office 2021 and Microsoft 365) 1st Edition - Ebook PDFPDF Download
100% (2)
Microsoft Excel Formulas and Functions (Office 2021 and Microsoft 365) 1st Edition - Ebook PDFPDF Download
35 pages
Design of Analog CMOS Integrated Circuits, Solutions (McGraw) - RAZAVI
79% (33)
Design of Analog CMOS Integrated Circuits, Solutions (McGraw) - RAZAVI
329 pages
Advantages and Limitations of Fully On-Chip CNN
No ratings yet
Advantages and Limitations of Fully On-Chip CNN
5 pages
A CNN Accelerator On FPGA Using Depthwise
No ratings yet
A CNN Accelerator On FPGA Using Depthwise
5 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
Sensors 23 02045
No ratings yet
Sensors 23 02045
16 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
An Efficient Reconfigurable Hardware Accelerator For CNN
No ratings yet
An Efficient Reconfigurable Hardware Accelerator For CNN
5 pages
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
No ratings yet
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
11 pages
FPGA Implementation of A Convolutional Neural Network For Wake Up Word Detection - Project Assignment - Ole Martin Skafsa - NTNU
No ratings yet
FPGA Implementation of A Convolutional Neural Network For Wake Up Word Detection - Project Assignment - Ole Martin Skafsa - NTNU
120 pages
Mhamdan Publication
No ratings yet
Mhamdan Publication
7 pages
10.1109 fpl53798.2021.00061
No ratings yet
10.1109 fpl53798.2021.00061
6 pages
PM Chi Zhang
No ratings yet
PM Chi Zhang
1 page
Towards Reconfigurable CNN Accelerator For FPGA Implementation
No ratings yet
Towards Reconfigurable CNN Accelerator For FPGA Implementation
5 pages
Efficient CNN Accelerator On FPGA
No ratings yet
Efficient CNN Accelerator On FPGA
9 pages
Conceptual Modeling (CM) For Military
100% (1)
Conceptual Modeling (CM) For Military
334 pages
7-Research On FPGA High-Performance Implementation Method of CNN
No ratings yet
7-Research On FPGA High-Performance Implementation Method of CNN
5 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
9 pages
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
No ratings yet
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
7 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
A New Hardware-Efficient VLSI-Architecture of GoogLeNet CNN-Model Based Hardware Accelerator For Edge Computing Applications
No ratings yet
A New Hardware-Efficient VLSI-Architecture of GoogLeNet CNN-Model Based Hardware Accelerator For Edge Computing Applications
4 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
Electronics 13 01564 v2
No ratings yet
Electronics 13 01564 v2
18 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
No ratings yet
Emerging Field of NoC-Based Convolution Architecture With Systolic Arrays
6 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
2017.01.jssc - Eyeriss Design
No ratings yet
2017.01.jssc - Eyeriss Design
12 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
3098 15835 1 PB 2011 PDF
No ratings yet
3098 15835 1 PB 2011 PDF
6 pages
Group Work Project: Mscfe 660 Case Studies in Risk Management
100% (1)
Group Work Project: Mscfe 660 Case Studies in Risk Management
7 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
Effect of Tax Avoidance and Tax Evasion
No ratings yet
Effect of Tax Avoidance and Tax Evasion
13 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Energy-Efficient Deep Learning Inference On Edge Devices
No ratings yet
Energy-Efficient Deep Learning Inference On Edge Devices
55 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
10 1109vdat50263 2020 9190274
No ratings yet
10 1109vdat50263 2020 9190274
6 pages
Design and Implementation of An NoC-Based Convolution Architecture With GEMM and Systolic Arrays
No ratings yet
Design and Implementation of An NoC-Based Convolution Architecture With GEMM and Systolic Arrays
4 pages
1 s2.0 S1877050922005701 Main
No ratings yet
1 s2.0 S1877050922005701 Main
6 pages
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Cao 2019
No ratings yet
Cao 2019
5 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Irmak2021energy Efficient
No ratings yet
Irmak2021energy Efficient
4 pages
Boycott List of Israel Items
No ratings yet
Boycott List of Israel Items
3 pages
Rongshi 2019
No ratings yet
Rongshi 2019
4 pages
ADVANCED POWER SYSTEM ANALYSIS DESIGN SY23 24 1st Semester ELECTIVE 1 Part A
No ratings yet
ADVANCED POWER SYSTEM ANALYSIS DESIGN SY23 24 1st Semester ELECTIVE 1 Part A
23 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
DEBJANI MAITY Resume
No ratings yet
DEBJANI MAITY Resume
4 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
Directive Principles of State Policy
No ratings yet
Directive Principles of State Policy
2 pages
Chapter 1 V6.1
No ratings yet
Chapter 1 V6.1
75 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
Use Case Lookup
No ratings yet
Use Case Lookup
17 pages
Using DDL Statements To Create and Manage Tables
No ratings yet
Using DDL Statements To Create and Manage Tables
41 pages
Guddu Jha - Organized
No ratings yet
Guddu Jha - Organized
3 pages
Nonprofit Organizations
No ratings yet
Nonprofit Organizations
25 pages
MAC
No ratings yet
MAC
5 pages
Coa Lecture Unit 3 Pipelining
No ratings yet
Coa Lecture Unit 3 Pipelining
95 pages
Valvula Mariposa Danais 150
No ratings yet
Valvula Mariposa Danais 150
15 pages
Bessel 1
No ratings yet
Bessel 1
1 page
Perform An Ethical Analysis of Facebook
83% (6)
Perform An Ethical Analysis of Facebook
1 page
UL WelcomeGuide
No ratings yet
UL WelcomeGuide
28 pages
IVYSS Code of Conduct Policy For Employees-1
No ratings yet
IVYSS Code of Conduct Policy For Employees-1
8 pages
Alert Profiles
No ratings yet
Alert Profiles
4 pages
Poll Watchers' Guide: 13 May 2019 National and Local Elections
No ratings yet
Poll Watchers' Guide: 13 May 2019 National and Local Elections
58 pages
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
No ratings yet
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
3 pages
TU154s F
100% (1)
TU154s F
2 pages
Tamil Presentation
No ratings yet
Tamil Presentation
5 pages
Current Mirror With Extended Bandwidth
No ratings yet
Current Mirror With Extended Bandwidth
5 pages
Practice Problems 3 PDF
No ratings yet
Practice Problems 3 PDF
4 pages
Cambria Company v. Cosmos Granite & Marble, NC - Complaint
No ratings yet
Cambria Company v. Cosmos Granite & Marble, NC - Complaint
17 pages
FDM Com128 ch1 ch2
No ratings yet
FDM Com128 ch1 ch2
1 page
Sal Proj Statement r3
No ratings yet
Sal Proj Statement r3
86 pages
Module 4: The Problems: Cyber Antipatterns
No ratings yet
Module 4: The Problems: Cyber Antipatterns
12 pages
Memsic 2125 Accel Guide v2.1
No ratings yet
Memsic 2125 Accel Guide v2.1
3 pages
MH12NR9505 PDF
No ratings yet
MH12NR9505 PDF
2 pages
Long Term Water Repellent Treatment For External Masonry: Belzona® 5122
No ratings yet
Long Term Water Repellent Treatment For External Masonry: Belzona® 5122
2 pages
TLP 3526
No ratings yet
TLP 3526
6 pages
Tutorial 1
No ratings yet
Tutorial 1
8 pages
C# Programming Simplified: A Step-by-Step Guide for Beginners and Intermediates with Hands-On Projects
From Everand
C# Programming Simplified: A Step-by-Step Guide for Beginners and Intermediates with Hands-On Projects
Chloe Annable
No ratings yet
Python for Renewable Energy Applications
From Everand
Python for Renewable Energy Applications
Abdellatif Sadeq
No ratings yet

FPT2017 PipeCNN

Uploaded by

FPT2017 PipeCNN

Uploaded by

PipeCNN: An OpenCL-Based Open-Source FPGA

Accelerator for Convolution Neural Networks

Abstract—Convolutional neural networks (CNNs) have been

I. I NTRODUCTION Fig. 1. The top-level architecture of PipeCNN.

978-1-5386-2656-6/17/$31.00 © 2017 IEEE 279 FPT 2017

based data buffering scheme. As shown in Fig. 4, the ﬁlter

computation. To exploiting data reuse, the MemRD kernel

Excution Time (ms)

(a) (b) (c) (d)

FPGA Resource Resource Execution Time

C. Performance Evaluation R EFERENCES

You might also like