Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
[email protected], [email protected]
Corresponding Author: Peipei Yan Email:[email protected]
Abstract—Before CNN, image recognition methods Although CNN can meet the needs of image
mainly relied on artificial design features, which can only recognition in function, at present, image recognition is
represent the medium and low-level information in the image, mostly implemented by software, which can not reflect the
and it is difficult to extract the deep-level information of the advantages of neural network parallel processing, and the
image. CNN simulates the human brain by establishing a CPU or GPU has limited computing power, which limits
deep neural network to analyze, learn and interpret data. It the complexity of neural network structure [6]. In order to
has strong expression ability and generalization ability, and improve the computational performance of CNN, this
can better represent the deep-seated information of images.
paper proposes a CNN image recognition acceleration and
At present, CNN has been widely used in many scenes,
optimization algorithm based on FPGA, which realizes
including image classification, speech recognition, video
analysis, document analysis and so on. Most of the existing
CNN acceleration by introducing a heterogeneous
artificial intelligence image recognition systems are development environment based on FPGA. The
implemented by software and often accelerated by GPU. experimental results show that compared with CPU, GPU,
However, GPU has high power consumption and is not ASIC and DSP, the recognition accuracy of CNN image
suitable for CNN reasoning, so it can not make use of the recognition algorithm based on FPGA is higher than that of
parallel computing power of CNN. To solve the above Lenet-5 and AlexNet in different data sets, and the accuracy
problems, this paper studies the application method of CNN is the same as that of Vgg16, while the average recognition
image recognition acceleration and Optimization Based on time is 50% shorter than that of the three network models.
FPGA, designs and optimizes the CNN forward model by FPGA has low power consumption, and the power
using Intel FPGA board. The experimental results show that consumption of CPU is 2. 1 times, the power consumption
FPGA has low power consumption, the power consumption of GPU is 6. 5 times. Compared with the image recognition
of CPU is 2.1 times, while the power consumption of GPU is system based on DSP chip, the recognition time can be
6.5 times. The average recognition time of the algorithm is 50% reduced by 60% on the premise of 1024 identical data sets.
shorter than that of Lenet-5,AlexNet and VGG16, and the
recognition time of 10000 sample pictures is 165μs.
II. IMAGE RECOGNITION DESIGN SCHEME
Compared with 426.6μs required by DSP chip system.
Compared with the methods proposed in the literature in
related fields in recent years, the proposed method has higher
A. Image recognition architecture
throughput and computational performance. Image preprocessing in traditional image recognition
technology is to enhance the image features, segment the
Keywords— CNN; FPGA; Artificial intelligence; Image image according to the feature differences, separate the
recognition; Data flow optimization target to be recognized from the whole image, and then
extract the features for the separated target.
I. INTRODUCTION
CNN is a deep learning neural network model, which
At present, the following deep learning models are calculates the output value through forward propagation
mainly used in artificial intelligence image recognition, and adjusts the connection weight value through back
including deep belief network (DBN) [1], convolutional propagation. Sparse connection and weight sharing are its
neural network (CNN) [2], recurrent neural network (RNN) two most important characteristics. The neurons in each
[3]
Generative adversarial network (GAN) [4] and capsule layer and their adjacent layers are no longer fully connected,
network (CapsNet) [5]. As one of the common technologies but partially connected, that is, the perception area of a
in the field of computer vision, CNN has been used in many neuron is only affected by some neurons in the upper layer.
fields, such as image recognition, speech recognition, data Such a network structure reduces the complexity of the
mining, intelligent robot and so on. model. Figure 1 shows the basic structure of CNN. CNN
1947
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
B. Software and hardware function division the final classification result of the image is obtained by
The data processing process of CNN application mainly executing the Softmax function.
consists of three parts: data preprocessing, classification or (2) Storage design
recognition using network, and output recognition results
Among them, the image preprocessing and recognition This paper designs independent data for input data,
result output part are closely related to the front and back weight, offset, convolution and pooling to obtain functions,
ends of the application, and the demand for computing optimizes the core data flow of the accelerator, and uses the
power is low; The computation of data classification and data flow optimization instructions The instruction can not
recognition is large, and the data throughput in the task is only perform the data acquisition operation in parallel, but
relatively high. Therefore, this part constructs a special also perform the data acquisition of convolution calculation
hardware accelerator through reconfigurable hardware unit and cache unit in parallel, so as to improve the
resources to give full play to the high computing efficiency parallelism.
advantage of FPGA. The core algorithm is divided according to the
The functional module division [8] of the target system parallelism of each step and whether the calculation data
is shown in Figure 5 In order to ensure that after division, are independent of each other During CNN computing,
the system software can be reprogrammed according to the only memory read kernel and memory write kernel can
application and ensure the reconfigurability of the access global memory, and these cores realize data
hardware accelerator, CNN's hardware accelerator is transmission through channels The channel adopts FIFO
connected with the processor through a high-performance for direct data transmission between cores. Figure 6 shows
on-chip bus. The accelerator is designed in the form of the channel diagram RAM stores eigenvalues and weights
data-driven, and the input and output are set as the bus Channel transmission is adopted to reduce the transmission
interface supported by the hardware platform. delay and improve the transmission efficiency.
1948
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
instruction is used on the host program, which is used to TABLE II. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK
MODELS WHEN THE TEST DATA SET IS CIFAR-10
specify the kernel program of the loop expansion and tell
the compiler how many times the loop expansion is network model Accuracy Average recognition time(s)
required (or the compiler determines how many times the LetNet-5 0.787 2.4
loop expansion is required) Loop expansion makes full use
of the parallelism between convolution cores, makes full AlexNet 0.810 3.3
use of computing resources and performs computing VGG16 0.832 5.5
operations in parallel.
Improved CNN
0.828 1.3
(3) Global memory optimization neural network
This design adopts data channel storage, and uses the TABLE III. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK
channel to transfer the data from one core to another core MODELS WHEN THE TEST DATA SET IS VOC2012
in FPGA, so as to avoid multiple access of checking the network model Accuracy Average recognition time(s)
global memory data in the calculation process, which
greatly improves the data parallelism and reduces the data LetNet-5 0.757 2.7
throughput and the delay of the whole system. AlexNet 0.783 3.5
B. Test data set and network model C. Comparison between the optimized algorithm and
Through the establishment of data negative list sharing traditional recognition technology
me In order to verify the performance of the improved RBF Aiming at the task of video retrieval, this paper uses
neural network image recognition algorithm, this paper ASIC circuit to realize the system. The comparative
compares the accuracy and average recognition time with experiment is carried out with the image recognition
Lenet-5, AlexNet and VGG16 network models on MNIST, system based on DSP chip. The environment of the
cifar-10 and voc2012 data sets respectively. The comparative experiment is set as follows: (1) the sample
experimental results are shown in TABLE Ⅰ-Ⅲ. library contains 10000 256 byte samples; (2) Input a 256
byte data externally; (3) The objective is to calculate the
TABLE I. COMPARATIVE EXPERIMENT OF DIFFERENT NETWORK difference between the external data and each sample in the
MODELS WHEN THE TEST DATA SET IS MNIST sample library; (4) The output result includes sample name
and difference; (5) The output result is required to find the
network model Accuracy Average recognition time(s)
sample with the smallest difference value and the
LetNet-5 0.989 1.2 corresponding difference value.
AlexNet 0.991 1.5 After setting the above environment, the calculation
VGG16 0.997 2.3
results of various parameters based on DSP image
recognition algorithm and the algorithm proposed in this
Improved CNN
neural network
0.996 0.9 paper are shown in TABLE Ⅳ.
1949
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.
TABLE IV. PERFORMANCE COMPARISON BETWEEN IMAGE V. CONCLUSIONS
RECOGNITION ALGORITHM BASED ON DSP AND ALGORITHM PROPOSED
IN THIS PAPER This paper presents a CNN accelerator based on FPGA.
Calculation quantity DSP chip Paper algorithm
The accelerator has an efficient pipelined kernel, strong
scalability and is suitable for a variety of network models
clock frequency 500 15 Through the in-depth understanding of the working
ALU quantity 6 1024 principle, characteristics and structure of FPGA, the
network model structure is simplified, and the hardware
Operation bit width(bit) 32 8 implementation of image recognition based on
Single sample size(Byte) 256 256 convolutional neural network is optimized by
The number of times each cycle can
comprehensively using the methods of software and
6 2048 hardware division, loop optimization, pipeline optimization,
be added
Each addition processes data(Byte) 24 2048
data flow optimization and data cache Compared with CPU,
the CNN accelerator based on FPGA has faster loading
Number of cycles required to
complete the comparison of two 21.33 0.25
speed; Compared with other schemes, our scheme has
samples higher throughput and computational performance
Time to complete the comparison of Compared with Lenet-5, AlexNet and VGG16 network
42.66 16.5
two samples(us) models, the recognition accuracy of the proposed algorithm
Time required to compare with all is better than Lenet-5 and AlexNet network, and is the same
426.6 165
samples(us) as VGG16 network, while the average recognition time is
As can be seen from TABLE Ⅳ , the number of 50% shorter than the three network models; The hardware
arithmetic logic units (ALUs) of CNN recognition part adopts FPGA to realize the feature extraction process,
algorithm based on FPGA reaches 1024, which is and the image recognition process adopts ASIC circuit to
significantly higher than that of system ALUs based on realize the image recognition function. Compared with the
DSP chip. Due to the increase in the number of ALUs, the image recognition system based on DSP chip, the
number of addition operations performed by the system in recognition time of the proposed method can be reduced by
each cycle reaches 2048 times, and the amount of about 60% on the premise of 1024 pictures.
calculated data reaches 2048 bytes. The time for the system
to complete all samples is 165 μs. About 60% less than DSP
REFERENCES
chip.
[1] LIU Fangyuan, WANG Shuihua, and ZHANG Yudong.Review of
deep confidence network model and
D. Resource occupation of each module applicationresearch[J].Computer Engineering and Applications,
After comprehensive layout and wiring, the resource 2018,54(1): 11-18, 47.doi: 10.3778/j.issn.1002-8331.1711-0028.
occupation of each module is listed in TABLEⅤ, in which [2] LI Guoliang, ZHOU Xuanhe, SUN Ji, et al. A survey of machine
learning based database techniques[J]. Chinese Journal of
convolution operation and read memory operation occupy Computers, 2020, 43(11): 2019-2049.
more resources Because CNN is multiplication intensive, [3] LIANG Tian and AFZEL N. Software reliability prediction using
the number of DSPs consumed is the main factor to recurrent neural network with Bayesian regularization[J].
evaluate the utilization of hardware resources TABLEⅥ International Journal of Neural Systems,2004, 14(3): 165-174.doi:
shows the utilization of resources It can be seen from table 10.1142/S0129065704001966.
7 that the resource utilization rate of the board is relatively [4] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al.
low, and there are still enough resources in the system. The Generative adversarial nets[C]. Proceedings of the 27th
International Conference on Neural Information Processing
program can be further optimized to improve the resource Systems, Montreal, Canada, 2014: 2672-2680.
utilization rate. [5] SABOUR S, FROSST N, and HINTON G E. Dynamic routing
between capsules[C]. Proceedings of the 31st International
Conference on Neural Information Processing Systems, Long
TABLE V. RESOURCE OCCUPATION OF EACH MODULE
Beach, USA, 2017: 3856-3866.
module FF LUT DSP BRAM [6] REN Yuan, PAN Jun, LIU Jingjing, et al.Overview of artificial
convolution 37874 48929 48 4 intelligence chip development[J].Micro/Nano Electronics and
Irn 11967 11262 21 8 Intelligent Manufacturing, 2019, 1(2): 20-34.
Pooling 3500 7757 0 32 [7] WANG Y,ZHOU H Y,FENG H,et al.Network traffic classif-I
Menread 23168 53354 22 112 cation method based on deep convolutional neural network
menwrite 4780 7316 17 2 [J].Journal of Communications,2018,39(1)14-23.
[8] LU Y,CHEN Y,LI T,et al.Construction method of embedded FPGA
TABLE VI. RESOURCE CONSUMPTION convolutional neural network for edge computing [J].Computer
Research and Development,201855(3):55-562.
Resource type Number of Number of Utilization
used available rate(%)
FF 148698 2215298 6.71
LUT 118072 1059293 11.15
DSP 108 6836 1.60
BRAM 101 1926 5.24
1950
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:16 UTC from IEEE Xplore. Restrictions apply.