0% found this document useful (0 votes)
56 views6 pages

A CNN Accelerator On FPGA With A Flexible Structure

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views6 pages

A CNN Accelerator On FPGA With A Flexible Structure

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 5th International Conference on Computational Intelligence and Applications (ICCIA)

A CNN Accelerator on FPGA with a Flexible Structure

Dan Shan Guotao Cong


Department of Electronic Engineering Department of Electronic Engineering
Dalian Neusoft University of Information Dalian Neusoft University of Information
Dalian, P. R. China Dalian, P. R. China
e-mail: [email protected] e-mail: [email protected]

Wei Lu
School of Control Science and Engineering
Dalian University of Technology
Dalian, P. R. China
e-mail: [email protected]

Abstract—Most of the existing convolutional neural networks processing and transaction processing are not suitable for
(CNNs) are based on PC software, which cannot meet the real- mining the inherent parallelism of CNN, so this paper
time, low power and miniaturization requirements of the developed a CNN accelerator based on FPGA, which makes
systems. In this paper, a CNN accelerator with flexible full use of abundant logic units, wiring resources and the
structure based on Field-Programmable Gate Array (FPGA) is advantages of parallel processing to pursue fast speed, low
proposed to achieve recognition of MNIST handwritten power consumption and miniaturization of the system.
numeric characters. The system adopts deep pipeline By analysis of CNN architecture, this paper excavates the
processing and optimizes inter-layer and intra-layer
single-layer and inter-layer parallelism from both coarse-
parallelism from two levels of coarse and fine granularity. In
grained and fine-grained levels [6]. The developed
view of the similarity of convolution structure, this design
adopts structured circuit, which can easily expand the number
accelerator has been implemented on Artix7 FPGA, which
of layers and neurons. The classification throughput and inter- uses internal resources fully, adopts deep pipeline structure,
layer data throughput capability can be improved by carries out parallel optimization, and converts floating-point
rationally organizing the internal memory resources of the to fixed-point by curve fitting of activation function to
FPGA. Compared with the general CPU, it achieves 3 times improve the operational efficiency. Finally, the MINST
acceleration at 50MHz frequency, while the power handwritten dataset [7] is used as the recognition object.
consumption is only 2% of the CPU. Finally performance and Most of the CNN accelerators implemented in FPGA
power consumption are compared with other accelerators by focus on the limitation of memory bandwidth and parallelism.
VGG16. Considering the memory bandwidth, on-chip memory is
unachievable for deeper CNN with numerous parameters [8-
Keywords-convolutional neural networks, low power, 12]. In contrast, [10-12] stored parameters in off-chip
acceleration, field programmable gate array, parallelism memory at the expense of precision. Reference [13] and [4]
optimize the parallelism of computation. Reference [14]
I. INTRODUCTION implemented CNN using High Level Synthesis. OpenCL
frameworks for CNN are evaluated in [15].
With the development of integrated circuits and the
For the mentioned situation, the key contributions of this
improvement of computer processing ability, neural network
brief are: (1) A high performance CNN hardware accelerator
has become a hotspot [1] as a new field of artificial
framework is proposed where parallelism optimization is
intelligence and machine learning. Further CNN is a deep
achieved both at single layer and inter-layer from coarse-
learning algorithm which has been applied in machine vision
grained and fine-grained levels. (2) Flexible on-chip memory
[2-3], pattern recognition [4] and other fields because of its
management is adopted to maximize on-chip resource
high adaptability of image translation, scaling, tilting and etc.
utilization by adjusting the parallelism and pipeline depth
It is a sensitive sensor to extract graphical features, avoiding
according to the internal resource characteristics for different
the complex feature extraction and data reconstruction
FPGAs. (3) With structured design, the convolution layer
process in traditional recognition algorithm [5]. At present,
and pooling layer adopt standard interface, which can easily
the traditional implementation of CNN mostly depends on
realize CNN reconstruction and scale expansion. (4) By
general purpose processor platform, which runs in software
applying the proposed framework and methods, deep CNN
form. Consequently its performance, power consumption,
can be implemented on common FPGA. This represents a 3
size and data throughput need to be improved urgently.
times speedup and %2 power consumption compared with
Considering that CNN structure has high parallelism, the
that on CPU.
characteristics of general purpose processors for logic

978-1-7281-6042-9/20/$31.00 ©2020 IEEE 211


DOI 10.1109/ICCIA49625.2020.00047

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.
This brief is organized as follows. Section II describes B. Optimized CNN Structure
basic CNN and the improved CNN structure. Section III As a classical multi-layer artificial neural network
describes system design for hierarchical architecture structure, LeNet-5 was proposed by LeCun et al. [16] as
implementation, optimization of structural scalability, on- early as the 1990s. In order to improve the training
chip memory organization, parallel processing and floating- recognition speed and ensure the recognition accuracy, the
point to fixed-point conversion. Section III illustrates the traditional LeNet-5 model was optimized and the algorithm
results of experiment. Finally the conclusion is given in was improved [17], and part of the full connection layer was
Section V. deleted. The optimized basic CNN structure adopted in this
II. THE IMPROVED CNN paper shown in Figure1 is similar to DeepLearn Toolbox in
matlab. It is different from LeNet-5 in several aspects,
A. Basic CNN including removing a circle of black edges of the image,
increasing the number of convolution kernels in C3 layer,
CNN is a very typical forward propagation neural replacing C5-F6 to output network with full connection from
network, especially suitable for two-dimensional data S4 directly to output. The structure of CNN is shown in Fig.1,
processing application scenarios. It includes input layer, which includes input layer, first convolution layer C1, first
output layer and hidden layers. The number of neurons in the pooling layer S2, second convolution layer C3, second
input layer is related to the dimension of the input data; the pooling layer S4 and fully connected output layer O5.
number of neurons of the output layer, usually using full
6x24x24
connection, is the same as the number of types to be 6x12x12
6x12x8x8
12x8x8
12x4x4
Kernel
identified; the hidden layers can be single or multiple, Kernel 12x2x2
192

6x12x5x5
usually composed of alternating convolution layer and Kernel
6x5x5
Kernel
6x2x2
10

pooling layer. Convolution layer convolutes the


28x28
corresponding area of the image with a kernel, and obtains a .
.

value through the non-linear transformation of the activation .


.
.
function. Then it moves the convolution window
continuously to carry out convolution and activation
operation, consequently completes the convolution operation
of the whole image. The convolution kernel, i.e. the size of
the receptive field, is usually chosen as 5×5 or 3×3. The
pooling layer is located between the convolution layers, or Figure1. Optimized CNN structure diagram.
between the convolution layers and the output layer, which is
used to compress the scale of neurons and weights step by Assume the input gray-level image resolution is 28 × 28.
step. Through the sliding pooling window, the data in the In C1 layer, convolution operations are performed with 6 5 ×
window is compressed to a value, and the size and stride of 5 convolution kernels. The convolution step is set to 1. Thus,
the sliding window determine the size of the output feature 6 24 × 24 convolution images are obtained. After each pixel
map. The characteristics of each layer of CNN are obtained in the image is biased, 6 24 × 24 output feature maps of the
by the shared weighted kernel of the local rather than global layer are obtained by activation function to clamp the data
region of the previous layer to reduce the number of weights within a certain range and introduce non-linear factors of the
and the training difficulty as shown in formula (1), m and n network. Commonly used activation functions such as
denote the number of input and output feature maps Sigmoid function are chosen as activation functions. There
respectively, xjl denotes the jth feature map of layer l, f is the are 24 × 246 = 3456 neurons, 5 × 5 × 6 = 150 weights, 1 × 6
activation function, and Kij and bj denote the weights and = 6 biases and (5× 5 + 1) 56 = 896 connections in this layer.
biases of corresponding layers, respectively. The S2 layer is the pooling layer or the down-sampling
m
layer. Here, the mean pooling method is used to avoid losing
clj f (¦ xil 1 kij  b j ), j 1, 2, ˜˜˜, n (1) the information of neurons. Sampling template size is 2×2,
i 1 sampling stride is set to 2, each 2×2 area of input image is
sampled without overlapping in adjacent area. This layer
The forward-propagating signal is processed by each outputs 6 12×12 feature images and generates 12×12×6 =
layer of neurons, and finally reaches the output layer. After 864 neurons.
calculating the output error, the residual error of the layer C3 layer is similar to C1 layer. Each of the 6 12×12
satisfies the recurrence relationship, so the error layer is feature images is convoluted with 12 5×5 convolution
transmitted back in reverse. According to the influence of kernels, and the convolution step is set to 1. Thus, 12 8×8
each weight on the error, the weight value is adjusted, and convolution images of this layer are obtained, totally 6
the process of forward-transmission and reverse-adjustment groups. Then, the 6 pixels in each group are added in turn,
of the weight value is repeated according to the number of and after adding the bias, 12 8×8 output feature images are
training samples. Finally, the appropriate weights are obtained by activation function. There were 8×8×12 = 768
obtained, and the recognition error of feed-forward network neurons, 8×8×5×12×6 = 115 200 weights, 1×12 = 12 biases,
is controlled within a certain range. and 115 200 + 8×12 = 115 968 connections in this layer.

212

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.
Layer S4 is similar to the S2 layer, the size of the For layer C3 and S4, they are similar to C1 and S2
sampling window is 2×2 and the stride is 2. The layer respectively. Layer C3 caches 8 rows of FIFO for each of the
outputs 12 4×4 feature images and generates 4×4×12 = 192 12 characteristic graphs. Layer S4 is 4×4 feature images and
neurons. In the O5 layer, 192 neurons output from S4 are 4 rows of data are cached in 4 single-line FIFOs respectively.
expanded into one-dimensional vectors, which are the inputs 4 rows are output one by one through a row selector and
of the O5 layer and the output is a fully connected single- converted into 1-D image data. For 12 channels, there are
layer neural network of 10 neurons. There are 192×10 = 12×4 ×4 = 192 neurons.
1920 connections in this layer.
III. SYSTEM DESIGN

A. Hierarchical Implementation Architecture


Considering the forward computing characteristics of
convolution layer, pooling layer ,output layer in CNN, and
taking the internal circuit resources of the FPGA into
account, the layers of C1, S2, C3, S4 and O5 , and the
seamless connection between layers are designed as shown
in Fig. 2 to 3.
C1 (6 channel parallel)
(28x28) 5x5 Register
784 x 1 bias0
row_0
row_1
Figure 3. S2 Layer computing structure.
FIFO
......

buffe Kernel_
Two- Shift_RAM Sigmoid
dimensional
images enter Solving
Shift_RAM
row_2
row_2
The O5 layer is the fully connected output layer. 192
the
neurons from the S4 layer are input and 10 classification
......

the
......

accelerator in Asynchro
-nous Shift_RAM
one-
dimensional Clock
Shift_RAM
bias5
row_0
row_1
results with weight values are generated. Each output is
Domain
form
connected to 192 weights which are stored in ROM as initial
......

Kernel_ Sigmoid
row_2
row_2 values, totally 10 ROMs used. The intermediate
Figure 2. C1 Layer computing structure. classification results are obtained by multiplying and
summing 192 neurons, respectively. Finally, the output
For C1 layer, as shown in Fig. 2, in order to save the results are traversed once to get the maximum value, that is,
number of FPGA ports, the input data is converted from 2-D the final classification result.
image of 28×28 to 1-D image of 784×1. Each pixel value is
input sequentially. In C1 layer, FIFO is used to buffer the B. Structural Scalability Optimization
input image and solve the asynchronous problem of external CNN is implemented in the way of modular architecture.
clock 50MHz to internal 75MHz. After image caching, for Both convolution layer and pooling layer adopt standardized
facilitating the pipeline convolution operation, five rows of interfaces. Inter-layer connections only need to meet the
data are output simultaneously through 5 shifted RAM on- standard interfaces and timing requirements. It is not only
chip, and then 5 registers array is cached to form a 5×5 convenient to realize the overall level expansion of CNN, but
image window for convolution operation. After convolution also convenient to realize the number expansion of inter-
operation with 6 5×5 convolution kernels, the corresponding layer neurons. At the circuit implementation level, the CNN
offset is added to the activation function. Because the output architecture of different scales can be accomplished only
neuron structure of this layer is 24×24 characteristic image, through the repeated construction of modules and parallel
and when the convolution window moves between rows, structures within modules. For example, the convolution
there is 5 clock cycle gaps. In order to make the circuit layer input is 1-D image information from the upper left
structure clear and easy to implement, the output result of corner of the image to the lower right corner of the image,
activation function is stored in 24 single-line FIFOs and the output feature map is in line parallel; the pooling
separately. layer input is in line parallel graphic information, and output
For layer S2, the pooling window is of 2×2, as shown in is 1-D image information from the upper left corner of the
Fig. 3, 24 rows of FIFO read data are grouped into 12, 2 image to the lower right corner of the image. In this way, the
rows 1 group, and each group enters 2×2 register array to output of the convolution layer can be directly connected to
realize the pooling operation. Here, the average pooling is the pooling layer as the input of the pooling layer, and the
adopted, that is to say, the average of four elements in the output of the pooling layer can also be directly connected to
pooling window is taken as the result of pooling, that is, the the convolution layer as the input of the convolution layer to
sum of four pixels is shifted 2 bits right to get the pooling realize seamless connection. When it is necessary to extend
output. The output neuron of this layer is 12×12 feature the overall hierarchy of CNN, the standard layers can be
image, which is simulated by the implementation of C1 layer. replicated and reused conveniently without additional design.
Here, 12 rows of data are cached in 12 single-line FIFOs,
and 12 rows are output one by one through a row selector to C. Optimizing on-chip Memory Usage
1-D image data. There are some memory resources in FPGA, which can
be used as RAM or ROM. Generally, CNN weights can be

213

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.
stored in on-chip memory resources, or in off-chip SDRAM According to the symmetry of Sigmoid function, when
or Flash medium, but in terms of speed of weight acquisition, the input data is in the range of [32768,65536], the output
on-chip memory has obvious advantages. Here, in order to value is obtained according to formula (3).
save on-chip memory resources, some weights are stored in 1 (2)
sigmoid ( x)
on-chip memory, and the other weights are directly involved 1  e x
in calculation in the form of parameters. The saved resources f ( x) 4096  f (65536  x) (3)
are used to cache neuron data in convolution layer and 32768 d x d 65535
pooling layer, which avoids the read-write time of off-chip In the process of curve fitting, multiplication and division
SDRAM, and further improves the CNN computing speed. operations are needed. Multiplication is realized by 16-bit
D. Parallelism Optimization fixed-point hard-kernel multiplier with signed bits, and
division by 4096 is realized by shifting registers right 12
Considering the internal resources of the FPGA, the
times.
parallelism of the design is optimized from two levels of
coarse-grained and fine-grained. From the coarse-grained IV. EXPERIMENTAL RESULTS AND ANALYSIS
level, there is no feedback crossover between the forward
conduction levels of CNN, so the pipeline structure can be The training process of CNN is completed in PC, and the
realized in convolution, pooling and inter-layer transmission. training weight is saved. The training weight is fixed and
The convolution layer can achieve up to 5 pipelines and introduced into FPGA for the use of feed-forward network.
pooling layer can achieve up to 12 pipelines. From the fine- The circuit of FPGA is implemented on AX301 board whose
grained level, the circuit structure between different channels device is XC7A200TSBG484 of Artix-7 series and the
within the layer is parallelized to realize the parallel development environment is Vivado 2018 which has its own
processing of different channel data. simulator for simulation. The MNIST data set is selected as
the digital recognition library which has 60,000 images in the
E. Floating-point to Fixed-point Optimization training set and 10,000 images in the test set. Each image is a
The image information in MINST data sets is float in 28×28 pixel gray-scale image with a float gray value in [0,1].
interval [0,1], and if the floating point processing is carried The image is converted to fixed-point and sent to FPGA.
out directly in FPGA, a lot of logical resources are consumed.
So the input data of MINST data sets are transformed to TABLE II. COMPUTATION SPEED, POWER CONSUMPTION AND
fixed-point, and then sent to FPGA for further calculation. RECOGNITION RATE COMPARISON
FPGA CPU Speedup
The rule of transforming floating-point to fixed-point is to images time speed time speed Ration
enlarge the floating-point by 4096 times and store it in 16 ms GMAC/s ms GMAC/s
bits. After normalization, the weights are enlarged 4096 1 0.526 0.598 1.92 0.131 3.65
times, and participate in the internal circuit operation of TABLE III. USAGE OF FPGA RESOURCES
FPGA in the form of parameters or ROM initial values. Module Usage of Resourses Rate/%
When convolution operation involves multiplying image LUT 88 756 66
data and weight, which is equivalent to multiplying LUTRAM 596 1
FF 42 038 16
4096×4096=224 times. In order to ensure the consistency of DSPs 571 77
the multiplier before and after data amplification, the BRAM 218 60
multiplier results are shifted 12 bits to the right, and then
participate in the subsequent calculation.
Compared with the software implementation on CPU, the
F. Segmental Fitting Optimization of Activation Function speed and power consumption based on FPGA are improved
Activation function is the key of introducing non- as shown in Table II. The CPU uses Core i7-
linearity into CNN structure. The most commonly used way [email protected] and the software development tool is
to realize it is the look-up tables which needs a large number Visual Studio 2017. The utilization of resources in FPGA is
of memory to ensure the accuracy. Here, in order to save shown in Tab.3. FPGA can achieve 0.598 GMAC/s
memory resources, especially on-chip memory resources, multiplication and accumulation. Compared with the general
and ensure accuracy, the curve fitting is adopted. Sigmoid CPU, the speed of the FPGA is 3 times faster, while the
function is performed curve fitting here [1] shown in Table I. power consumption is 1.225w in FPGA compared with 78w
by CPU. It means the power consumption is reduced to
TAB LE I: CURVE FITTING OF SIGMOID FUNCTION IN FPGA 1.57% instead. Figure4 shows the simulation waveform of
Interval Fitting function predicting the classification of a single image “9”.
[0, 2 048] f(x)=1004x/4096+2050
[2 049,10 035] f(x)=-184x2/4096/4096+1167x/4096+2012
[10 036, 16 384] f(x)=-71x2/4096/4096+615x/4096+2695
[13 385, 22 938] f(x)=-17x2/4096/4096+198x/4096+3500
[20 939, 30 720] f(x)=6x/4096+4050
[30 721, 32 767] f(x)=4096

214

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.
network is conveniently organized to realize comparison
with other accelerators. It also shows good performance.
REFERENCES
[1] Yuxi Zhang, Hanying Yang, and Yaotian Zhang. “Implementation of
sigmoid function based on FPGA”, Computer Engineering and
Application, vol. 52(S1), pp. 501-504, 2016.
[2] Shuiwang Ji, Wei Xu, and Ming Yang. “3D Convolutional Neural
Networks for Human Action Recognition”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 35(1), pp. 221-231,
2013.
[3] C. Farabet, C. Poulet, and J.Y. Han et a1. “CNPAn FPGA based
Processor for Convolutional Networks”, Proceedings of International
Figure 4. Simulation waveform of CNN Architecture for number “9”. Conference on Field Programmable Logic and
Applications.Washington D.C., USA:IEEE Press, pp. 32-37, 2009.
Further, in order to compare the performance and power [4] Y. LeCun, L. Bottou, and Y. Bengio et a1. “Gradient based Learning
consumption with other accelerators, the flexible Applied to Document Recognition”, Proceedings of the IEEE, vol.
configurability of this design structure is fully utilized. A 86(11):, pp. 2278-2324, 1998.
common VGG16 network structure is checked. Image data [5] Zijian Yu, De Ma, and Xiaolang yan et a1. “Convolutional Neural
and weight data are all 8-bit fixed-point numbers. The Network Accelerator Based on FPGA”, Computer Engineering, vol.
comparison results are shown in Tab.4. The calculation 43(1):, pp. 109-114, 2017.
speed is significantly improved compared with CPU, while [6] S. Chakradhar, M. Sankaradas, and V. Jakkula et a1. “A Dynamically
Configurable Coprocessor for Convolutional Neural Networks”,
the power consumption is significantly reduced compared ACM SIGARCH Computer Architecture News, vol. 38(3), pp. 247-
with GPU. 257, 2010.
[7] L. Yann, C. Corinna. MNIST Dataset [EBIOL] . [1816-01-02] . http :I
TABLE IV ACCELERATOR PERFORMANCE POWER CONSUMPTION lyannlecuncomlexdblmnist/.
COMPARISON TABLE [8] Y. Chen et al., “DaDianNao: A machine-learning super computer”, in
Energy Proc. 47th Annu. IEEE ACM Int. Symp. Microarchit. (MICRO), pp.
Power/ Performance efficienc 609-622, 2014.
Algorithm Platform Precision
W /(GOP/s) y/(GOP/
[9] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the
(s·W))
sensor”, ACM SIGARCH Comput. Archit. News, vol. 43(3), pp. 92-
Virtex 7 32-bit
Ref.[14] 18.61 61.6 3.31 104, 2015.
VX485T fixed
Virtex 7 16-bit [10] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring
Ref.[15] 30.20 565.9 18.73 heterogeneous algorithms for accelerating deep convolutional neural
VX690T fixed
Zynq networks on FPGAs”, in Proc. 54th ACM/EDAC/IEEE Design
8-bit Autom. Conf. (DAC),Austin, TX, USA, pp. 1-6, 2017.
Ref.[18] XC7Z02 3.50 84.3 24.08
fixed
0 [11] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Automated mapping
Arria10 16-bit of convolutional neural networks on FPGAs”, in Proc. ACM/SIGDA
Ref.[19] 37.46 1790.0 47.78
GX1150 fixed Int. Symp. Field Program. Gate Arrays (FPGA), pp. 291-292, 2017.
Ref.[20]- I7- 8-bit [12] H. Li et al., “A high performance FPGA-based accelerator for large-
140.00 63.0 0.45
CPU 6850K fixed scale convolutional neural networks”, in Proc. 26th Int. Conf. Field
Ref.[20]- GTX 8-bit Program. Logic Appl. (FPL), pp. 1-9, 2016.
250.00 2 839.4 11.36
GPU 1080Ti fixed
Artix-7 [13] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop
8-bit operation and dataflow in FPGA acceleration of deep convolutional
proposed XC7A20 14.13 164.1 11.60
fixed neural networks”, in Proc. ACM/SIGDA Int. Symp. Field Program.
0
Gate Arrays (FPGA), pp. 45-54, 2017.
[14] Zhang C, Li P, Sun G Y. Optimizing FPGA-based accelerator design
V. CONCLUSION for deep convolutional neural networks[C]//Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable Gate
This system makes full use of the high-speed parallel Arrays, Monterey, Feb 22-24, 2015. New York: ACM,pp. 161-170,
processing ability and low power characteristics of FPGA, to 2015.
implement the design of the convolutional neural feed- [15] Li H, Fan X, Jiao L, et al. A high performance FPGA-based
forward network accelerator. It exploits the characteristics of accelerator for large- scale convolutional neural networks
CNN structure fully, utilizes memory resources on chip of [C]//Proceedings of the 26th International Conference on Field
FPGA effectively, improves circuit parallelism by both Programmable Logic and Applications, Lausanne, Aug 29-Sep 2,
2016. Piscataway: IEEE, pp. 1-9, 2016.
coarse-grained and fine-grained level, and facilitates network
scale expansion in a flexible interfaced architecture. The [16] Shanghai Wang, Xinxing Jing, Haiyan Yang. “Study of Isolated
Speech Recognition Based on Deep Learning Neural Network”,
performance and lower consumption of the feed-forward Application Research of Computers, vol. 34(8), pp. 2289-2291, 2015.
network circuit are tested by MNIST data set, which shows [17] Shanghai Wang, Xinxing Jing, Haiyan Yang. “Study of Isolated
the speed of the accelerator is 3 times faster than that of the Speech Recognition Based on Deep Learning Neural Network”,
general CPU, and the power consumption is reduced to about Application Research of Computers, vol. 34(8), pp. 2289-2291, 2015.
2%. Further, aided by flexible interfaced modules, VGG16 [18] Guo K Y, Sui L Z, Qiu J T, et al. Angel- eye: a complete design flow
for mapping CNN onto embedded FPGA[J].IEEE Transactions on

215

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.
Conputer-Aided Design of Integrated Circuits and Systems, vol. 37(1), Gate Arrays, Monterey, Feb 22- 24, 2017. New York: ACM, pp. 25-
pp.35-47, 2018. 34, 2017.
[19] Zhang J, Li J. Improving the performance of OpenCL-based FPGA [20] LI Bingjian, QIN Guoxuan, ZHU Shaojie, PEI Zhihui, “Design of
accelerator for convolutional neural network[C]//Proceedings of the FPGAAccelerator Architecture for Convolutional Neural Network”,
2017 ACM/SIGDA International Symposium on Field-Programmable Journal of Frontiers of Computer Science and Technology, vol. 14(3),
pp. 437-448, 2020.

216

Authorized licensed use limited to: Carleton University. Downloaded on October 04,2020 at 13:46:26 UTC from IEEE Xplore. Restrictions apply.

You might also like