0% found this document useful (0 votes)
29 views4 pages

Rongshi 2019

This paper presents an FPGA-based implementation of the Lenet-5 convolutional neural network (CNN) to enhance image recognition performance. The authors optimize convolution operations to improve computing parallelism, data throughput, and energy efficiency, achieving over four times the throughput of traditional processors while consuming only 1.8W of power. The results demonstrate the potential of FPGA platforms for efficient deep learning applications, with a total performance of 0.343 GFLOPS.

Uploaded by

hathanh2124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views4 pages

Rongshi 2019

This paper presents an FPGA-based implementation of the Lenet-5 convolutional neural network (CNN) to enhance image recognition performance. The authors optimize convolution operations to improve computing parallelism, data throughput, and energy efficiency, achieving over four times the throughput of traditional processors while consuming only 1.8W of power. The results demonstrate the potential of FPGA platforms for efficient deep learning applications, with a total performance of 0.343 GFLOPS.

Uploaded by

hathanh2124
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2019 3rd International Conference on Circuits, System and Simulation

Accelerator Implementation of Lenet-5 Convolution Neural Network Based on


FPGA with HLS

Dai Rongshi Tang Yongming


School of Electronic Science and Engineering School of Electronic Science and Engineering
Southeast University Southeast University
Nanjing, China Nanjing, China
e-mail: [email protected] e-mail: [email protected]

Abstract—Convolution neural network is widely used in image performance of convolution neural network, which can
recognition because it can imitate the behavioral balance the data throughput and power consumption properly.
characteristics of biological visual nerve, and has high However, due to the high cost and long R & D cycle of
recognition accuracy. It is a kind of feed forward neural ASIC chip, it is difficult to use it on a large scale. Therefore,
network which contains convolution computation and has deep FPGA is a suitable platform in a short time. The convolution
structure. Also it is one of the representative algorithms of neural network accelerator based on FPGA has attracted
deep learning. Because the convolution neural network has a more and more researchers' attention, because it has the
special calculation mode, the general processor is not high in advantages of good performance, high efficiency, fast
the implementation efficiency of the convolution neural
development cycle and strong reconstruction ability [1][2][3].
network and can not meet the performance requirements. In
In the implementation of different model algorithms,
order to solve this problem, we implement convolution neural
network on FPGA and optimize convolution operation to there are different optimization schemes. While making
improve computing parallelism, data throughput and energy reasonable use of the resources of FPGA, the efficiency
efficiency of traditional processors. Finally, we implemented optimization can be satisfied as far as possible. Different
the convolution neural network of the Lenet-5 model on the design schemes may lead to the huge transformation of
ZYBOZ7 FPGA board and compared it with traditional performance, especially the data read many times, the
processor. We realized the fast recognition of a picture at the multiple storage results in the waste of resources and the
frequency of 100M Hz with DMA control. The data delay of time. This paper completes the acceleration of
throughput of FPGA is more than four times higher than that convolution neural network on FPGA. Firstly, we propose an
of general processor, and the power consumption is 1.8W, operation mode to realize data flow on FPGA. Secondly, we
which is much lower than that of general processor. reduce the repeated reading of data through register buffer
area in the convolution process, and puts forward an
Keywords-FPGA; Lenet-5; HLS; convolution neural network; operation mode to realize the data flow in the process of
optimization convolution. In the end, in order to avoid the time delay
caused by the processor and the low bandwidth effect of the
I. INTRODUCTION
interface part, the direct memory acquisition method is
In recent years, convolution neural network is widely adopted in the data stream acquisition.
used as a kind of neural network, which has great effects in
different fields. As a well-known in-depth learning II. BACKGROUND
framework, convolution neural networks have a lot of
A. CNN Basics
applications in various fields, including machine vision,
image search, image classification and so on [10]. Inspired The research on convolution neural network began from
by biological visual nerve, convolution neural network uses 1980s to 1990s. Time delay network and Lenet-5 are the
convolution kernel for sliding operation, extract features earliest convolution neural networks. After the 20th century,
from image, and transforms features into final results by convolution neural network has been developed rapidly with
mapping between layers. And it has relatively high accuracy the development of deep learning theory and the
in image recognition. improvement of numerical computing equipment, and has
In most of the front-end platform implementation been widely used in computer vision, natural language
methods, convolution neural network implementation processing and other fields [9]. The convolution neural
depends on CPU or GPU to complete calculation. But for network is used to construct the visual perception mechanism
some tasks, the front-end platform has the characteristics of of bionic organisms. The convolution kernel parameters in
small size and low power, and the general processor is not the hidden layer and the sparse connection between the
efficient enough. Then we need to find a new front-end network layers enable the convolution neural network to
platform replacement to complete the task [8]. lattice the layer data with less computation in order to
According to the above requirements, FPGA and ASIC facilitate feature extraction. When convolution neural
chips are the new development direction [3] [9]. FPGA and networks are used as supervised learning algorithms,
ASIC chips are used as front-end accelerators to improve the feedforward part is often used for image recognition,
classification, and feedback part for network training.

978-1-7281-3657-8/19/$31.00 ©2019 IEEE 64


Most users use trained weight data to use convolution features of these images are not unique. They can contain
neural network and complete some real-time tasks, so the most of the shapes, mainly through the sliding of the
speed of feedforward calculation is more important. In this convolution core. The weight of convolution kernel should
paper, we focus on the acceleration of feedforward part of include one-to-one mapping between input layer and output
FPGA in convolution neural networks. layer. Classifier is the process of transforming feature layer
The traditional neural network architecture consists of into output structure.
two parts: feature extractor and classifier. The function of the As shown in Figure 1, in the case of Lenet-5, feature
feature extractor is to extract the features of the input image. extractors often contain convolution and downsampling
By mapping the feature map to another feature map, the layers. The full connection layer is a classifier.

Figure 1. Lenet-5 structure

The Lenet-5 network model is a three-layer convolution The data operations of convolution neural networks are
layer, a two-layer downsampling layer and a two-layer full carried out in FPGA, but the RAM resources in FPGA are
connection layer. often very low. The intermediate results are cached by
TABLE I. LENET-5 NETWORK LAYER PARAMETER
interacting with the external DDR. Data transmission
through high-speed data interface, it starts to speed up the
Layer Conv1 Conv2 Conv3 Fullconnect1 Fullconnect2 operation process.
Weights <1,5,5,6> <6,5,5,16> <16,5,5,120> <120,84> <84,10> III. ACCELERATOR DESIGN EXPLORATION
Bias 6 16 120 84 10 This section starts with an overview of our accelerator
architecture and introduces several design challenges on the
B. FPGA Speedup FPGA platform. In order to overcome these challenges, we
FPGA clock is usually just a few hundred MHz, but the propose corresponding optimization techniques.
general purpose processor's main frequency is as high as a A. Design Overview
few GHz, but it's often speeded up faster than the general
purpose processor. A general purpose processor may need
many clock cycles to perform specific operations (such as
signal processing, image processing), but FPGA can directly
generate a dedicated circuit by programming a reformer
circuit, optimizing the parallel operation, pipelining, and
optimizing the memory, improves the speed of reading and
writing, greatly improves the speed of operation on specific
operations.
In this paper, there are several optimization methods to
improve the speed in FPGA [12].
Array partitioning can change the order of the array in
memory, as well as the number of ports to increase the data
exchange rate.
Array Reshaping can change the bit width of the memory. Figure 2. Overview of accelerator design
By changing the bit width, more data can be transferred in
one input / output. As shown in Figure 2, a CNN accelerator design includes:
Unrolled Loop can be expanded by unfolding the loop, the arm, DDR, AXI DMA, CNN IP kernel. Arm interacts
increases the parallelism of read and write and reduces the with the outside data, writes the input stream data into the
delay of operations. DDR, and then DMA imports the data stream into the CNN
Loop Pipelining pipelining the loop to start running the IP core through the AXI stream protocol in MM2S (memory-
next command before the last command operation is over. mapped to stream).

65
The CNN IP kernel is optimized by vivado HLS (high- the weight and out variables are reorganized appropriately
level synthesis). The whole system is generated by diagram [6], as shown in Figure 7.
in vivado and the arm-side operation is completed in vivado
SDK.
B. HLS Optimization
Xilinx Vivado ®High Level Synthesis tool converts C
language to register transfer level (RTL) implementation,
and can be integrated into Xilinx field programmable gate
array (FPGA). The C specification can be written using the C,
SystemC or Open Computing language (OpenCL) API C
kernel, and FPGA provides a massively parallel architecture
that outperforms traditional processors in terms of
performance, cost, and power consumption.
Its main advantages are:
1) Improve the development efficiency of hardware
design. Figure 3. Unroll loop circuit
2) Improve the system performance of software design.
3) At the C language level, the algorithm is developed We get the formula (2) to shorten the performance to a
quarter of the original.
and verified.
4) Using optimization instructions to complete C out = tp ∗ (weight1 + weight2 + weight3 + weight4)
language to HDL multi-implementation. (2)
5) Create readable and portable C language code.
Although the loop has been optimized, the reading of
For-loop optimization: The most time-consuming part of
parameters after the loop unfolds will first be limited to the
the convolution neural network is convolution operation,
port of memory, one clock memory can only be read one
which mainly optimizes the convolution operation [7]. For
data, so memory optimization is required. the main methods
three-dimensional input images, convolution operations
are Array partitioning and Array reshaping.
usually have six layers of loops, namely, input layers, output
After optimization, the number of memory ports and the
layers, image row, image column, convolution kernel row
multiple of loop expansion can be the same.
and convolution kernel column. Therefore, convolution
operations by traditional processors will spend a lot of time C. System Setup
on convolution operations [5]. The high-level integrated IP core is added to the vivado
project. After configuring the system through ARM, the data
The convolution operation code is as follows: flow is controlled by DMA [11], and the data in DDR is
for(i=0;i<Depth;i++){ directed to the convolution neural network (IP).
for(m=0;m<N4;m++){ DMA transfer copies data from one address space to
for(n=0;n<N4;n++){ another. When ARM initializes this transmission action, the
for(p=0;p<M;p++){ transmission action itself is implemented and completed by
for(q=0;q<M;q++){ the DMA controller. DMA transmission is very important for
tp=imagein[i][m+p][n+q]; high-performance embedded system algorithms and
for(j=0;j<Depth1;j++){ networks.
#pragma HLS UNROLL As shown in the following figure, first configure the data
out[j][m][n]+=(tp*weight[i][j][p][q]); in DDR through ARM, then configure DMA to read the data
}}}}}} stream through AXI stream, and DMA input the data stream
i is the number of input layers, j is the number of output layers,
into CNN IP, ARM to read the recognition result of
m is the output image row, n is the output image column, p is the
convolution kernel row, and Q is the convolution kernel column.
convolution neural network through AXI lite bus [4].

By formula (1)
out[j][m][n]+= ( imagein[i][m + p][n + q] ∗ weight[i][j][p][q])
(1)
The input imagein is independent of the j (output layer)
size, so the input layer is assigned to the tp register outside
the output layer. Convolution operations will increase
Depth1 times as fast as resources allow.
Convolution operations up to six layers of Loop.
Considering the resource capacity, the cyclic UNROLL, and
Figure 4. Convolution neural network control structure

66
The main functions in the control structure are as follows: 4.7x speedup compared with the general purpose processor
1) Write data streams to DDR implementation. With lower power consumption, the total
2) Configure DMA module by AXI lite performance of our accelerator reaches 0.343 GFLOPS.
3) Read data streams through DMA to write to CNN IP
4) Finally read the recognition result V. CONCLUSION
In this paper, we proposed a CNN FPGA acceleration
IV. EVALUATION method based on Lenet-5 model with FPGA platform ZYBO
This section first introduces our lab environment settings, Z7. First, the calculation and memory access of CNN is
and then provides a full range of lab results. optimized, and then all possible problems are modeled under
the Lenet-5 model to find the optimal solution for each layer.
A. Experimental Setup We find the best optimization design by comparing the
The accelerator is designed with Vivado HLS (v2018.2). results of many experiments. Finally, the performances on
The tool allows the accelerator to be implemented in C the ZYBO Z7 board can achieve achieved a low power of
language and exports RTL for a Vivado IP kernel. CNN 1.8 W with a data throughput of 0.343 GFLOP which was
design C code by adding a compilation wizard defined by much better than on traditional processor. Due to the unique
HLS to implement parallelization, and the parallel version is structure and the computing ability of the FPGA, the FPGA
verified by a time series analysis tool. The rapid pre- has great potential in low power consumption, and the
synthesis simulation is accomplished by the combination of development of the FPGA has a wide prospect in the case of
C simulation and C/RTL simulation of the tool. The pre- satisfying the load requirement.
synthesis resource report is used to design space exploration
and performance estimation. The exported RTL is REFERENCES
synthesized and implemented using Vivado v2018.2. [1] D. Aysegul, J. Jonghoon, G. Vinayak, K. Bharadwaj, C. Alfredo, M.
Our implementation is based on a ZYBO Z7 board with a Berin, and C. Eugenio. Accelerating deep neural networks on mobile
Xilinx FPGA chip zynq7020. It operates at 100MHz and the processor with embedded programmable logic. In NIPS 2013. IEEE,
2013.
software implementation runs on the Intel (R) Core (TM)
i5x4590 [email protected] which contains a 4-core, based on [2] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf.
A programmable parallel accelerator for learning and classification.
windows 10 system, and a Dell-based 02YYK5 board. In Proceedings of the 19th international conference on Parallel
architectures and compilation techniques. ACM, 2010, pp.273-284.
B. Result
[3] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. A
In this section, we first report on resource occupancy, and dynamically configurable coprocessor for convolutional neural
then compare the software implementation of (CPU) with networks. In ACM SIGARCH Computer Architecture News, vol. 38,
our accelerator implementation of (FPGA). Finally, the ACM, 2010, pp. 247-257.
comparison between our implementation and the existing [4] Zhang C, Li P, Sun G, et al. Optimizing fpgabased accelerator design
FPGA implementation is given. for deep convolutional neural networks[C]. Acm/sigda International
Symposium on Field-programmable Gate Arrays. ACM, 2015,
Routing is provided by the Vivado toolset. The tool then pp.161-170.
reports on resource occupancy, as shown in Table II. We can [5] J. Cong and B. Xiao. Minimizing computation inconvolutional neural
see that our CNN accelerator has almost fully utilized the networks. In Artificial NeuralNetworks and Machine Learning,
hardware resources of FPGA. In Table III, we compare CPU ICANN 2014, Springer, 2014, pp.281-290.
to FPGA in detail. In the use of the CPU, we built the [6] Ma Y, Cao Y, Vrudhula S, et al. Optimizing loop operation and
lenent5 network through MATLAB and realized the function dataflow in FPGA acceleration of deep convolutional neural networks.
of recognizing handwritten numbers. Acm/sigda International Symposium on Field-programmable Gate
Arrays. ACM, 2017, pp. 45-54.
TABLE II. FPGA RESOURCE UTILIZATION [7] Qiu J, Wang J, Yao S, et al. Going Deeper with Embedded FPGA
Resource DSP BRAM LUT FF Platform for Convolutional Neural Network. Proceedings of the 2016
Used 125 119.5 14659 14172 ACM/SIGDA International Symposium on Field-Programmable Gate
Available 220 140 53200 106400 Arrays. ACM, 2016, pp. 26-35.
Utilization 56.82% 85.36% 27.55% 13.32% [8] Cesare Alippi, Simone Disabato, et al. Moving Convolutional Neural
Networks to Embedded Systems: The AlexNet and VGG-16 case [C].
TABLE III. PERFORMANCE COMPARISON TO CPU ACM/IEEE International Conference on Information Processing in
Float CPU 3.3GHz FPGA Sensor Networks, 2018, pp. 212-223
32bit ms ms [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Layer1 9.230 2.72026 classification with deep convolutional neural networks. In F. Pereira,
Layer2 19.513 1.97655 C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural
Layer3 13.275 3.11795 Information Processing Systems 25, Curran Associates, Inc,2012, pp.
Layer4 0.201 1.21707 1097-1105.
Layer5 0.163 0.10270
[10] Ketkar N. Convolutional Neural Networks[J]. 2017.
Total 42.382 9.13516
Overall GFLOPS 0.073 0.343 [11] Kim D, Managuli R, Kim Y. Data Cache and Direct Memory Access
Speedup 1X 4.7X in Programming Mediaprocessors,20rd ed., vol. 4. Micro IEEE,
2001, pp.33-42.
We choose the CPU of 3.3GHz and the FPGA [12] Xilinx:Vivado Design Suite User Guide High-Level
Synthesis,UG902,2017.
accelerator to compare. The FPGA implementation achieves

67

You might also like