0% found this document useful (0 votes)
155 views

High Performance FPGA Based CNN Accelerator

Over the years, convolutional neural networks have been used in different applications, due to their ability to perform tasks using a reduced number of parameters compared to other in-depth learning methods
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views

High Performance FPGA Based CNN Accelerator

Over the years, convolutional neural networks have been used in different applications, due to their ability to perform tasks using a reduced number of parameters compared to other in-depth learning methods
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 6, Issue 3, March – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

High Performance FPGA Based CNN Accelerator


Pratiksha B. Dange Dr. S.L. Haridas
Department of Electronics & Telecommunication Engineering Department of Electronics & Telecommunication Engineering
J D College of Engineering & Management J D College of Engineering & Management
Nagpur, India Nagpur, India

Abstract:- Over the years, convolutional neural networks address the expansion of this "big data" the answer is found in
have been used in different applications, due to their Artificial Intelligence (AI). We can define it as an AI software
ability to perform tasks using a reduced number of or hardware application that thinks and solves problems as a
parameters compared to other in-depth learning methods. human being can. Problems ranging from language translation
However, the use of power and memory constraints, to internal image separation to understand and understand
which are often marginal and portable, often conflict with different faces and people. What we have found so far is small
the requirements of accuracy and latency. For these AI, which uses some algorithms and techniques to solve some
reasons, commercial commercial accelerators have specific problems.
become popular and their design is built on the tendencies
of the overall convolutional network models. However, the Since neural networks are naturally similar, they can
layout of the gate-mounted gateway represents an draw a significant amount in comparison to FPGAs (Field
attractive view because it offers the opportunity to use a Programmable Gate Arrays). Performance in FPGAs has been
hardware design designed for a particular model of a shown to have significantly lower power consumption per
convolutional network, with promising results in terms of function than equivalent in GPU (Graphical Processing Unit),
latency and power consumption. In this article, we which is a requirement for embedded systems. However,
propose a complete accelerator for chip-programmable implementation is no small feat because the development of
gate array hardware for convolutional neural network FPGAs is often done in hardware that describes the hardware,
partition, designed for a keyword recognition system. e.g. VHDL.

Keywords:- CNN, Accelerator, FPGA.

I. INTRODUCTION

As communication systems evolve and power levels


increase, the spectrum is pushed up to higher waves to deal
with the bulk of the information. With the introduction of 5G
mobile technology, these assumptions are thought to be as
high as between 3 and 27 GHz [10] which will go far beyond
the standard radar spectra, especially with the K-band radar.
With this comes the need to improve spectrum sensitivity and
signal identification algorithms that allow sensors and radios Figure 1 Humans and AI
to detect and identify spectrum users and participants. These
algorithms have traditionally been the result of hand-crafted Within the AI field, we find Machine Learning (ML),
masterpieces. With the recent practice of using a machine to which consists of using a large set of data and a number of
learn to process powerful results, neural networks have been partition algorithms to change the standard method we are
shown to do well in the problem of radio signal recognition. accustomed to making a program. With our standard
The culture of neural networks, and especially deep neural programming method, we create many algorithms that are
networks, has been applied to graphics processing units complex, but we are well versed in each of them. The basic
(GPUs). Today's GPUs are very powerful and have many idea of creating a division is to get big data and simply
similar computer features that are well suited for deep perform tasks to understand which of the data we need and
applications. Unfortunately GPUs work poorly with power to thus improve hands-on results and get a system that without
make them unsuitable for power-limited applications. Radar writing the whole algorithm, is able to make decisions based
works with a large amount of data and requires high on available data. Some of these categories follow
throughput and low latency, if the radar is installed in an mathematical methods we know, such as straight lines,
object without external power resources, it should also use as polynomial functions, mathematical functions, and so on.
much power as possible. For this reason, it may be necessary Some of these are very good at predicting a particular type of
to use these algorithms on customized hardware to meet these behavior that is very difficult to record in an algorithm such as
requirements. Field-based gate layout (FPGAs) has a good guessing the price of a house based on a history series.
balance between cost, energy efficiency and computational Further, as an ML branch, we find a specific data learning
resources that make it a good fit for this application. In recent process, known as Deep learning (DL).
years we have had a huge data boom in many fields. To

IJISRT21MAR422 www.ijisrt.com 1013


Volume 6, Issue 3, March – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
purposes. Nevertheless, since they were designed for the
implementation of generic CNNs, their architectures are
extremely flexible at the expense of the optimization of the
single model.

For such a reason, hardware accelerators customized for


a specific application might offer an interesting alternative for
accelerating CNNs. In particular, field-programmable gate
arrays (FPGAs) represent an interesting trade-off between
cost, flexibility, and performances especially for applications
whose architectures have been changing too rapidly to rely on
application-specific integrated circuits (ASICs) and whose
production volumes might be not sufficient. FPGAs offer high
flexibility at the same time, which permits the implementation
of different models with a high degree of parallelism and the
Figure 1.2: AI world possibility of customizing the architecture for a specific
application.
The development of DL occurred in a similar manner to
a study mainly of neural networks. It is characterized by II. LITERATURE SURVEY
efforts to create a learning model with multiple levels of
automation, where deeper levels take into account the effects A well-executed block of floating point-point (BFP)
of previous levels, transforming them and further amusing. was adopted in our accelerator to determine the functional
This intuition at the reading level gives a name to the whole tendency of deep neural networks in this paper. Feature maps
field and is inspired by how the brain of a mammal processes and model parameters are represented in 16-bit and 8-bit
information and learns, responding to external factors. formats, respectively, for off-chip memory, which can reduce
memory and band-band requirements by 50% and 75%
compared to 32-bit FP colleague. The proposed 8-bit BFP
figure with optimized advantages and flexibility of
performance-based schemes improves energy efficiency and
triple hardware. The FPGA-based CNN accelerator is
distributed on the Xilinx VC709 test board [1].

The functional design of the hardware is introduced


based on FFT performance due to the radix-2 frequency
decimation algorithm (R2DIF) and a distributed method that
allows data to be shared efficiently by keeping registry
changes. A well-rounded method / design uses a rotating
computer algorithm for converting connections (m-CORDIC)
and Radix-2r according to a coding system to replace
complex multiplication such as FFT. The m-CORDIC
Figure 1.3: Mammal brain and Convolutional process algorithm improves computer integration, while Radix-2r
allows logarithmic reduction of adder steps. Suggested design
Over the years, convolutional neural networks (CNNs) does not require large memory blocks used to retain a feature
have found application in many different fields such as object like twiddle [2].
detection [1, 2], object recognition [3, 4], depending on
memory and energy consumption, which often contradicts the CNN accelerator based on FPGA. The most effective
requirements of delay and accuracy. In particular, with accelerator function is designed to build a flexible neural
standard purpose-based solutions based on the use of a network and memory optimization with the use of low-cost
microcontroller, limited available memory limits network resources. The results show performance gains and power
complexity, with a potential impact on t accuracy.[7] compared to the Core i5 CPU and GTX 960 GPU [3].

In the same way, microcontroller based systems feature The hardware model is designed for CNN's advanced
the worst trade-off between power consumption and timing step-by-step use of hardware definition language, including
performances [8]. For this reason, commercial hardware CNN's computer architecture, multi-layer use, weight loading
accelerators for CNNs such as Neural Compute Stick (NCS) system, and data interference [4].
[9], Neural Compute Stick 2 (NCS2) [9], and Google Coral
[10] were produced. Such products feature optimized The possibility that existing existing low-power register
hardware architectures that allow to realize inferences of CNN (RTL) strategies could serve as a low-power design scheme
models with low latency and reduced power consumption. to accelerate an CNN-based object recognition system in
Standard communication protocols, such as Universal Serial contrast to conventional strategies. Many of the most
Bus (USB) 3.0., are generally exploited for communication effective design strategies for CNN acceleration focus on

IJISRT21MAR422 www.ijisrt.com 1014


Volume 6, Issue 3, March – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
High-level Synthesis (HLS) features, such as memory shrinking transistor costs. Moore observed that the cost of
bandwidth usage, network architecture, data reuse, and batch transistors depends on two factors. One is the density of
editing [5] transistors that can be crammed in onto a single chip, and the
second is the yield of fabrication. To maintain Moore’s law,
The CNN accelerator on the Xilinx ZYNQ 7100 two factors are critical:
hardware platform accelerates both standard resolution and 1) transiztor size - the smaller the better.
depth of cleverly divided convolution. Taking the design of 2) wafer size - the larger the better since more chips can be
the MobileNet + SSD network as a speed, the accelerator produced from a fixed number of processing steps.
simulated the measurement of the entire computer network
under the ZYNQ 7100 roof model. chip on the chip using the VI. DESIGN ARCHETECTURE
data streaming interface and set the ping-pong buffer mode.
[6] Accelerator architectures for neural nets There exists
two major architectures of hardware accelerators for neural
III. MOTIVATION networks, single computation unit accelerators or streaming
accelerators. Single computation unit (SCU) accelerators
A major problem with the implementation of the CNN- have a similar construction to a RISC CPU that executes
based model in the FPGA regarding the limitations in terms of instructions with a fixed datapath. Instead of an ALU or FPU,
hardware resources (combinations, sequences, Digital Signal the SCU accelerator has a dedicated matrix multiplier tailored
Processors (DSPs), ram blocks, etc.) of such devices. CNN for big matricies or a systolic array of computation elements.
algorithms are based on Multiply-and-Accumulate (MAC) When a network is to be accelerated on an SCU accelerator,
operations that require a large number of logical objects or instructions are generated for that specific network. The
DSPs. In addition, CNNs are characterized by a large number accelerator can then execute these instructions from memory.
of parameters such as resource requirements, number of work These types of accelerators are very flexible since the only
per second (Gops), Density Efficiency (DE), time required to network specific data that has to be stored are the instructions
create layers of CONVs, FC and Softmax, and Power and the parameters. This enables networks with different
Efficiency (PEff). For these reasons, the hardware speed of the architectures to be executed on the same accelerator. These
hardware was carefully designed taking into account the trade accelerators suit systems that execute several different
between the measurement period and the available resources. networks since the accelerator can be shared between the
tasks and the overhead to execute a new network is not that
IV. PROBLEM STATEMENT big. Even though SCU accelerators are often fixed for all
types of networks, they can be tailored to specific networks
Hardware accelerator performing convolution function, with regard to the width of datapath and size of matrix
the most critical function of ConvNet. To give an idea of the multiplier/systolic array to better match the layers sizes in the
computer load of ConvNet, in the AlexNet model, for network and yield a higher resource utilization. This semi
example, 90% of the processing time is spent on convolution tailoring of the SCU accelerators tend to reach higher
tasks. Moreover, the complexity of this network is strongly performance on CNN’s with a uniform structure. This is
linked to their depth and one of the major problems is that because the utilization of the shared compute unit increases if
this type of work requires a lot of memory. We will try to use the kernel sizes between layers are similar. It is also true if
the structures that perform this function in a structured way, the size ratio between different layers is a power of two.
and to use strategies to reduce the amount of access to When the deployment of neural networks onto these types of
external DRAM linked to FPGA during the calculation of accelerator are automated, the automation framework
convolution. The purpose of both is to exploit as much as. generates the instructions and quantizes the weights as
needed
V. HARDWARE ACCELERATORS

A hardware accelerator is a specialized hardware unit


that performs a set of tasks with higher performance or better
energy efficiency than a general-purpose CPU. Example of
common accelerators are GPUs, digital signal processors
(DSPs) and fixed-function application specific integrated
circuits (ASICs) like video decoders [11]. To understand why
accelerators have become so important, the history of the
semiconductor industry has to be taken into account. The
semiconductor industry has historically been driven by two
scaling laws: Moore’s Law and Dennard scaling. It is these
two scaling trends that have made CMOS technology so
popular in the computing industry. Moore’s law states that Figure 3: Basic structure of a single compute unit accelerator
the number of transistors that can economically be fit onto an
integrated circuit doubles every two years. It is more than just Accelerators with the streaming architecture always
shrinking transistors to yield better integration capabilities, it tailor the hardware with respect to the target network. The
is fundamentally a cost-scaling law. Moore was interested in layers are often directly implemented in hardware and its

IJISRT21MAR422 www.ijisrt.com 1015


Volume 6, Issue 3, March – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
possible to get a very high level of parallelism and utilization. [4]. Xiaofeng Chen1, Jingyu Ji1, Shaohui Mei1, Yifan
The intermediate results between layers can be stored in Zhang1, Manli Han2, Qian Du, “FPGA BASED
registers, memory or directly pipelined into the next layer. IMPLEMENTATION OF CONVOLUTIONAL
This architecture is better suited for smaller networks since a NEURAL NETWORK FOR HYPERSPECTRAL
direct mapping can consume a lot of resources. One way to CLASSIFICATION”, IGARSS ©2018 IEEE.
circumvent this resource constraint issue is by using a method [5]. Heekyung Kim and Ken Choi, “Low Power FPGA-SoC
called folding. With folding, one layer at a time is executed Design Techniques for CNN-based Object Detection
on the FPGA and the FPGA is reconfigured between each Accelerator”, ©2019 IEEE.
layer. Since the FPGA needs to be reconfigured between each [6]. Bing Liu , Danyin Zou, Lei Feng, Shou Feng, Ping Fu
layer, batches of data has to be executed to yield a sensible and Junbao Li, “ An FPGA-Based CNN Accelerator
throughput. Folding can generally yield a very high Integrating Depthwise Separable Convolution”,
throughput since the level of parallelism in each layer tend to Electronics 2019, 8, 281;
be high but the latency is often big since large batches of data doi:10.3390/electronics8030281
has to be executed. Figure 4 shows a block diagram of a [7]. Yuchi Tian et al. “Deep Test: Automated Testing of
simple streaming accelerator. Deep-neural network- driven Autonomous Cars”. In:
Proceedings of the 40th International Conference on
Software Engineering. ICSE ’18. 2018, pp. 303–314.
[8]. C. Zhang et al., “Optimizing fpga-based accelerator
design for deep convolutional neural networks,” in
Proc. ACM/SIGDA Int. Symp. Field-Programmable
Gate Arrays, Feb. 2015, pp. 161–170.
[9]. J. Qiu et al., “Going deeper with embedded fpga
platform for convolutional neural network,” in Proc.
ACM/SIGDA Int. Symp. Field-Programmable Gate
Figure 4: Basic structure of an streaming accelerator Arrays, pp. 26–35, 2016.
[10]. K. Guo et al., “Angel-Eye: A complete design flow for
VII. CONCLUSION mapping CNN onto embedded FPGA,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 37,
The design of cnn accelerator for improving no. 1, pp. 35–47, Jan. 2018.
performance of the system is very much important. The [11]. H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang,
design of accelerator which makes reduce of load over “A high performance FPGA-based accelerator for large-
CPU/GPU. The design improves the efficiency of a system. scale convolutional neural networks,” in Proc. 26th Int.
An hardware accelerator able to merge the demands in terms Conf. Field Program. Logic Appl., Aug./Sep. 2016, pp.
of speed and power through a careful analysis of the possible 1–9.
parallelization inside the CNN algorithm.

REFERENCES

[1]. Xiaocong Lian , Member, IEEE, Zhenyu Liu, Member,


IEEE, Zhourui Song, Jiwu Dai, Wei Zhou , Member,
IEEE, and Xiangyang Ji , Member, IEEE, “High-
Performance FPGA-Based CNN Accelerator With
Block-Floating-Point Arithmetic” ,IEEE
TRANSACTIONS ON VERY LARGE SCALE
INTEGRATION (VLSI) SYSTEMS, VOL. 27, NO. 8,
AUGUST 2019.
[2]. M. S. Kavitha1 · P. Rangarajan2, “An Efficient FPGA
Architecture for Reconfigurable FFT Processor
Incorporating an Integration of an Improved CORDIC
and Radix-2r Algorithm”, Circuits, Systems, and Signal
Processing. https://doi.org/10.1007/s00034-020-01436-
4
[3]. Sheping Zhai, Cheng Qiu, Yuanyuan Yang, Jing Li and
Yiming Cui,Sheping Zhai,, Cheng Qiu1, Yuanyuan
Yang1, Jing Li1 and Yiming Cui1, “Design of
Convolutional Neural Network Based on FPGA”.
CISAT 2018

IJISRT21MAR422 www.ijisrt.com 1016

You might also like