0% found this document useful (0 votes)
54 views5 pages

Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F

This document discusses accelerating and optimizing artificial intelligence CNN image recognition using FPGA. It proposes a CNN image recognition acceleration algorithm based on FPGA that realizes CNN acceleration through a heterogeneous development environment on FPGA. Experimental results show the FPGA-based algorithm has higher recognition accuracy than Lenet-5, AlexNet and similar accuracy to Vgg16 on different datasets, with recognition time 50% shorter than the three network models. FPGA also has lower power consumption than CPU and GPU solutions.

Uploaded by

Maryam Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views5 pages

Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F

This document discusses accelerating and optimizing artificial intelligence CNN image recognition using FPGA. It proposes a CNN image recognition acceleration algorithm based on FPGA that realizes CNN acceleration through a heterogeneous development environment on FPGA. Experimental results show the FPGA-based algorithm has higher recognition accuracy than Lenet-5, AlexNet and similar accuracy to Vgg16 on different datasets, with recognition time 50% shorter than the three network models. FPGA also has lower power consumption than CPU and GPU solutions.

Uploaded by

Maryam Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE ITOEC(ISSN:2693-289X)

Acceleration and optimization of artificial


intelligence CNN image recognition based on
FPGA
Peipei Yan1, Zhiwei Xiang 2
1. State Grid Xinjiang Electric Power Research Institute, Urumqi, China
2. State Grid Xinjiang Information and Communication Company, Urumqi, China
2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC) | 978-1-6654-3185-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ITOEC53115.2022.9734423

[email protected], [email protected]
Corresponding Author: Peipei Yan Email:[email protected]

Abstract—Before CNN, image recognition methods Although CNN can meet the needs of image
mainly relied on artificial design features, which can only recognition in function, at present, image recognition is
represent the medium and low-level information in the image, mostly implemented by software, which can not reflect the
and it is difficult to extract the deep-level information of the advantages of neural network parallel processing, and the
image. CNN simulates the human brain by establishing a CPU or GPU has limited computing power, which limits
deep neural network to analyze, learn and interpret data. It the complexity of neural network structure [6]. In order to
has strong expression ability and generalization ability, and improve the computational performance of CNN, this
can better represent the deep-seated information of images.
paper proposes a CNN image recognition acceleration and
At present, CNN has been widely used in many scenes,
optimization algorithm based on FPGA, which realizes
including image classification, speech recognition, video
analysis, document analysis and so on. Most of the existing
CNN acceleration by introducing a heterogeneous
artificial intelligence image recognition systems are development environment based on FPGA. The
implemented by software and often accelerated by GPU. experimental results show that compared with CPU, GPU,
However, GPU has high power consumption and is not ASIC and DSP, the recognition accuracy of CNN image
suitable for CNN reasoning, so it can not make use of the recognition algorithm based on FPGA is higher than that of
parallel computing power of CNN. To solve the above Lenet-5 and AlexNet in different data sets, and the accuracy
problems, this paper studies the application method of CNN is the same as that of Vgg16, while the average recognition
image recognition acceleration and Optimization Based on time is 50% shorter than that of the three network models.
FPGA, designs and optimizes the CNN forward model by FPGA has low power consumption, and the power
using Intel FPGA board. The experimental results show that consumption of CPU is 2. 1 times, the power consumption
FPGA has low power consumption, the power consumption of GPU is 6. 5 times. Compared with the image recognition
of CPU is 2.1 times, while the power consumption of GPU is system based on DSP chip, the recognition time can be
6.5 times. The average recognition time of the algorithm is 50% reduced by 60% on the premise of 1024 identical data sets.
shorter than that of Lenet-5,AlexNet and VGG16, and the
recognition time of 10000 sample pictures is 165μs.
II. IMAGE RECOGNITION DESIGN SCHEME
Compared with 426.6μs required by DSP chip system.
Compared with the methods proposed in the literature in
related fields in recent years, the proposed method has higher
A. Image recognition architecture
throughput and computational performance. Image preprocessing in traditional image recognition
technology is to enhance the image features, segment the
Keywords— CNN; FPGA; Artificial intelligence; Image image according to the feature differences, separate the
recognition; Data flow optimization target to be recognized from the whole image, and then
extract the features for the separated target.
I. INTRODUCTION
CNN is a deep learning neural network model, which
At present, the following deep learning models are calculates the output value through forward propagation
mainly used in artificial intelligence image recognition, and adjusts the connection weight value through back
including deep belief network (DBN) [1], convolutional propagation. Sparse connection and weight sharing are its
neural network (CNN) [2], recurrent neural network (RNN) two most important characteristics. The neurons in each
[3]
Generative adversarial network (GAN) [4] and capsule layer and their adjacent layers are no longer fully connected,
network (CapsNet) [5]. As one of the common technologies but partially connected, that is, the perception area of a
in the field of computer vision, CNN has been used in many neuron is only affected by some neurons in the upper layer.
fields, such as image recognition, speech recognition, data Such a network structure reduces the complexity of the
mining, intelligent robot and so on. model. Figure 1 shows the basic structure of CNN. CNN

978-1-6654-3185-9/22/$31.00 ©2022 IEEE 1946


Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
usually includes convolution layer, pooling layer and full
connection layer. Firstly, the input image is preprocessed
and the processed image is input into the neural network
model; Secondly, multiple convolution (conv) and pooling
operations are performed on the input image to obtain more
and more complex feature images, and then enter the full
connection layer; Finally, the convolution value of the
input image is calculated, the probability value is calculated
through the Softmax function, then the normalization
operation is carried out and sorted, the node with the
greatest probability is taken as the prediction target, and the Fig. 3. Technology roadmap
classification results are output.

A. Basic framework of CNN accelerator


At present, the scale of neural network is getting larger
and deeper. Due to the limited storage resources of FPGA,
the design of CNN accelerator based on FPGA must rely
on external storage chip The design combines external
DDR memory chip with FPGA, and the design architecture
is shown in Figure 4.
Fig. 1. Basic structure of CNN

B. Image recognition process


Image recognition technology uses the commonness
between objects to realize classification. Images with the
same commonness are classified into the same category,
such as color, shape, size and category. The classification
results are naturally different with different classification
standards Figure 2 is the flow chart of image recognition
using deep learning. Its working principle is as follows:
first preprocess the input image, then extract the features
that can reflect the essence of the image through feature
extraction, and finally input them into the trained network
model to obtain the recognition results.

Fig. 4. Basic architecture of CNN accelerator

The main control module includes DDR controller and


FIFO for storing intermediate results, which is responsible
for managing input characteristic diagram, network weight,
offset and intermediate results generated by various
Fig. 2. Flow chart of image recognition operations The DDR controller is responsible for
coordinating the communication between FPGA and
external DDR memory chip In case of emergency, as the
III. SCHEME IMPLEMENTATION data transmission mechanism between FPGA and DDR, it
This paper uses the combination of software and transmits the data in DDR to FPGA or FPGA to DDR PE
hardware to classify the collected images. The technical is responsible for calculating the input eigenvalues and
route is shown in Figure 3. Firstly, in the PC environment, weights to obtain the final results.
the CNN model architecture is built through the CPU, The host program first loads the image data from the
trained and tested to obtain better weight values and bias hard disk or webcam, preprocesses the image data and
values, and they are extracted At the same time, the CNN network weight, converts them into 8-bit fixed-point
hardware platform is built on the FPGA side, and the numbers, and then sends them to the operation unit of
weight and bias extracted on the PC side are applied to the FPGA accelerator to perform CNN forward calculation.
FPGA to realize the image classification on the hardware Finally, the result of CNN forward calculation is sent back
platform [7], which has achieved good acceleration effect. to the host through FPGA, and the result is displayed on the
host screen.

1947
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
B. Software and hardware function division the final classification result of the image is obtained by
The data processing process of CNN application mainly executing the Softmax function.
consists of three parts: data preprocessing, classification or (2) Storage design
recognition using network, and output recognition results
Among them, the image preprocessing and recognition This paper designs independent data for input data,
result output part are closely related to the front and back weight, offset, convolution and pooling to obtain functions,
ends of the application, and the demand for computing optimizes the core data flow of the accelerator, and uses the
power is low; The computation of data classification and data flow optimization instructions The instruction can not
recognition is large, and the data throughput in the task is only perform the data acquisition operation in parallel, but
relatively high. Therefore, this part constructs a special also perform the data acquisition of convolution calculation
hardware accelerator through reconfigurable hardware unit and cache unit in parallel, so as to improve the
resources to give full play to the high computing efficiency parallelism.
advantage of FPGA. The core algorithm is divided according to the
The functional module division [8] of the target system parallelism of each step and whether the calculation data
is shown in Figure 5 In order to ensure that after division, are independent of each other During CNN computing,
the system software can be reprogrammed according to the only memory read kernel and memory write kernel can
application and ensure the reconfigurability of the access global memory, and these cores realize data
hardware accelerator, CNN's hardware accelerator is transmission through channels The channel adopts FIFO
connected with the processor through a high-performance for direct data transmission between cores. Figure 6 shows
on-chip bus. The accelerator is designed in the form of the channel diagram RAM stores eigenvalues and weights
data-driven, and the input and output are set as the bus Channel transmission is adopted to reduce the transmission
interface supported by the hardware platform. delay and improve the transmission efficiency.

Fig. 6. Channel implementation block diagram

After introducing the channel, kernel 1, kernel 2 and


kernel 3 do not need to access the global memory to obtain
data, but directly obtain data from the FIFO buffer When
the kernel is executed, the data is read and written through
Fig. 5. System segmentation
the shift register, and the parallelism between the kernels is
improved through programming. The data required by the
C. FPGA design and optimization calculation of the next kernel is sequentially transmitted to
(1) Overall FPGA Architecture the channel when the previous kernel is executed.

FPGA architecture consists of multiple channels, each D. Data flow optimization


channel contains multiple processing units. The working
(1) Data cache optimization
principle of the accelerator is as follows The input image is
input from external memory DDR to FPGA on-chip cache In order to improve the throughput of FPGA, it is
through AXI bus, and the features involved in the necessary to transfer the off chip memory data to the on-
calculation are stored in RAM array The model weight is chip memory cache of FPGA through the bus, and then
input into the RAM array storing the weight through the execute the parallel kernel program on the device If the
AXI bus In view of the large number of model parameters, kernel program needs to access the data in the off chip
the on-chip local cache only stores the weights participating memory every time it executes, it will waste a lot of time
in the calculation. After the current calculation is on data transmission and degrade the performance In order
completed, the eigenvalues and weights of the convolution to avoid the kernel reading and writing data from off chip
layer are updated The input eigenvalues and weights are memory every time, try to store the data in FPGA on-chip
convoluted through the multiplication adder tree, and the memory In this way, the kernel will give priority to the use
convoluted results are written into the on-chip local of on-chip cache data, so as to avoid some unnecessary data
memory through ReLu, Pooling and other operations As transmission.
the feature input of the next convolution layer, the cached
feature image is convoluted with the convolution kernel (2) Cyclic optimization
again until the final output result is obtained The output The implementation of convolution layer has multi-
eigenvalue obtained by multiple convolution operations layer circularity. In order to further improve the parallel
transmits the data to the CPU through the AXI bus. Finally, execution performance of the program, the unroll

1948
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
instruction is used on the host program, which is used to TABLE II. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK
MODELS WHEN THE TEST DATA SET IS CIFAR-10
specify the kernel program of the loop expansion and tell
the compiler how many times the loop expansion is network model Accuracy Average recognition time(s)
required (or the compiler determines how many times the LetNet-5 0.787 2.4
loop expansion is required) Loop expansion makes full use
of the parallelism between convolution cores, makes full AlexNet 0.810 3.3
use of computing resources and performs computing VGG16 0.832 5.5
operations in parallel.
Improved CNN
0.828 1.3
(3) Global memory optimization neural network

This design adopts data channel storage, and uses the TABLE III. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK
channel to transfer the data from one core to another core MODELS WHEN THE TEST DATA SET IS VOC2012
in FPGA, so as to avoid multiple access of checking the network model Accuracy Average recognition time(s)
global memory data in the calculation process, which
greatly improves the data parallelism and reduces the data LetNet-5 0.757 2.7
throughput and the delay of the whole system. AlexNet 0.783 3.5

VGG16 0.813 6.7


IV. EXPERIMENT AND RESULT ANALYSIS
Improved CNN
0.808 2.6
neural network
A. Establishment of experimental environment
As can be seen from the above table, VGG16 network
In order to verify the effectiveness of this design has the highest recognition accuracy in three different data
method, the acceleration scheme described in this paper is sets, but the average recognition time of the network is the
tested on the board The selected FPGA development tool is longest due to the large amount of network parameters; The
SDX ide software of Xilinx company, the selected average recognition time of LeNet-5 and AlexNet networks
development board is Huawei fx600, and the hardware is 2 ~ 3s shorter than that of VGG16 network, but the
development environment is Vivado April 2017 Among accuracy is 3% ~ 6% lower than that of VGG16 network.
them, the FPGA chip of the high-performance fx600 The accuracy of CNN image recognition algorithm based
development board is Virtex UltraScale+VU9p FPGA, on FPGA proposed in this paper is better than LeNet-5 and
with four channels of ddr4-2400sdram. The partial AlexNet in different data sets, and is the same as VGG16
reconfiguration process is extended to achieve high network, but the average recognition time is reduced by
structural resource availability. The chip can be used in the 50%. Therefore, the algorithm proposed in this paper has a
Xilinx DMA subsystem of PCI ex press with PCIe gen3 certain improvement in recognition accuracy and
x16 connectivity. recognition time.

B. Test data set and network model C. Comparison between the optimized algorithm and
Through the establishment of data negative list sharing traditional recognition technology
me In order to verify the performance of the improved RBF Aiming at the task of video retrieval, this paper uses
neural network image recognition algorithm, this paper ASIC circuit to realize the system. The comparative
compares the accuracy and average recognition time with experiment is carried out with the image recognition
Lenet-5, AlexNet and VGG16 network models on MNIST, system based on DSP chip. The environment of the
cifar-10 and voc2012 data sets respectively. The comparative experiment is set as follows: (1) the sample
experimental results are shown in TABLE Ⅰ-Ⅲ. library contains 10000 256 byte samples; (2) Input a 256
byte data externally; (3) The objective is to calculate the
TABLE I. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK difference between the external data and each sample in the
MODELS WHEN THE TEST DATA SET IS MNIST sample library; (4) The output result includes sample name
and difference; (5) The output result is required to find the
network model Accuracy Average recognition time(s)
sample with the smallest difference value and the
LetNet-5 0.989 1.2 corresponding difference value.
AlexNet 0.991 1.5 After setting the above environment, the calculation
VGG16 0.997 2.3
results of various parameters based on DSP image
recognition algorithm and the algorithm proposed in this
Improved CNN
neural network
0.996 0.9 paper are shown in TABLE Ⅳ.

1949
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
TABLE IV. PERFORMANCE COMPARISON BETWEEN IMAGE V. CONCLUSIONS
RECOGNITION ALGORITHM BASED ON DSP AND ALGORITHM PROPOSED
IN THIS PAPER This paper presents a CNN accelerator based on FPGA.
Calculation quantity DSP chip Paper algorithm
The accelerator has an efficient pipelined kernel, strong
scalability and is suitable for a variety of network models
clock frequency 500 15 Through the in-depth understanding of the working
ALU quantity 6 1024 principle, characteristics and structure of FPGA, the
network model structure is simplified, and the hardware
Operation bit width(bit) 32 8 implementation of image recognition based on
Single sample size(Byte) 256 256 convolutional neural network is optimized by
The number of times each cycle can
comprehensively using the methods of software and
6 2048 hardware division, loop optimization, pipeline optimization,
be added
Each addition processes data(Byte) 24 2048
data flow optimization and data cache Compared with CPU,
the CNN accelerator based on FPGA has faster loading
Number of cycles required to
complete the comparison of two 21.33 0.25
speed; Compared with other schemes, our scheme has
samples higher throughput and computational performance
Time to complete the comparison of Compared with Lenet-5, AlexNet and VGG16 network
42.66 16.5
two samples(us) models, the recognition accuracy of the proposed algorithm
Time required to compare with all is better than Lenet-5 and AlexNet network, and is the same
426.6 165
samples(us) as VGG16 network, while the average recognition time is
As can be seen from TABLE Ⅳ , the number of 50% shorter than the three network models; The hardware
arithmetic logic units (ALUs) of CNN recognition part adopts FPGA to realize the feature extraction process,
algorithm based on FPGA reaches 1024, which is and the image recognition process adopts ASIC circuit to
significantly higher than that of system ALUs based on realize the image recognition function. Compared with the
DSP chip. Due to the increase in the number of ALUs, the image recognition system based on DSP chip, the
number of addition operations performed by the system in recognition time of the proposed method can be reduced by
each cycle reaches 2048 times, and the amount of about 60% on the premise of 1024 pictures.
calculated data reaches 2048 bytes. The time for the system
to complete all samples is 165 μs. About 60% less than DSP
REFERENCES
chip.
[1] LIU Fangyuan, WANG Shuihua, and ZHANG Yudong.Review of
deep confidence network model and
D. Resource occupation of each module applicationresearch[J].Computer Engineering and Applications,
After comprehensive layout and wiring, the resource 2018,54(1): 11-18, 47.doi: 10.3778/j.issn.1002-8331.1711-0028.
occupation of each module is listed in TABLEⅤ, in which [2] LI Guoliang, ZHOU Xuanhe, SUN Ji, et al. A survey of machine
learning based database techniques[J]. Chinese Journal of
convolution operation and read memory operation occupy Computers, 2020, 43(11): 2019-2049.
more resources Because CNN is multiplication intensive, [3] LIANG Tian and AFZEL N. Software reliability prediction using
the number of DSPs consumed is the main factor to recurrent neural network with Bayesian regularization[J].
evaluate the utilization of hardware resources TABLEⅥ International Journal of Neural Systems,2004, 14(3): 165-174.doi:
shows the utilization of resources It can be seen from table 10.1142/S0129065704001966.
7 that the resource utilization rate of the board is relatively [4] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al.
low, and there are still enough resources in the system. The Generative adversarial nets[C]. Proceedings of the 27th
International Conference on Neural Information Processing
program can be further optimized to improve the resource Systems, Montreal, Canada, 2014: 2672-2680.
utilization rate. [5] SABOUR S, FROSST N, and HINTON G E. Dynamic routing
between capsules[C]. Proceedings of the 31st International
Conference on Neural Information Processing Systems, Long
TABLE V. RESOURCE OCCUPATION OF EACH MODULE
Beach, USA, 2017: 3856-3866.
module FF LUT DSP BRAM [6] REN Yuan, PAN Jun, LIU Jingjing, et al.Overview of artificial
convolution 37874 48929 48 4 intelligence chip development[J].Micro/Nano Electronics and
Irn 11967 11262 21 8 Intelligent Manufacturing, 2019, 1(2): 20-34.
Pooling 3500 7757 0 32 [7] WANG Y,ZHOU H Y,FENG H,et al.Network traffic classif-I
Menread 23168 53354 22 112 cation method based on deep convolutional neural network
menwrite 4780 7316 17 2 [J].Journal of Communications,2018,39(1)14-23.
[8] LU Y,CHEN Y,LI T,et al.Construction method of embedded FPGA
TABLE VI. RESOURCE CONSUMPTION convolutional neural network for edge computing [J].Computer
Research and Development,201855(3):55-562.
Resource type Number of Number of Utilization
used available rate(%)
FF 148698 2215298 6.71
LUT 118072 1059293 11.15
DSP 108 6836 1.60
BRAM 101 1926 5.24

1950
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.

You might also like