0% found this document useful (0 votes)
59 views

Accelerated Deep Learning Inference From Constrained Embedded Devices

This document proposes using hardware looping and dot product instructions to accelerate deep learning inference on constrained embedded devices. It evaluates implementing a convolutional neural network (CNN) on an FPGA using these techniques. Specifically, it creates a CNN with simple loops, hardware looping, loop unrolling, and a combination. Hardware looping alone reduces clock cycles, while combining it with dot product instructions reduces cycles further. Evaluating the techniques on a Lenet-5 CNN with the MNIST dataset, hardware looping and dot product instructions show potential to enable low-cost microcontrollers to support deep learning applications.

Uploaded by

Bhargav Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Accelerated Deep Learning Inference From Constrained Embedded Devices

This document proposes using hardware looping and dot product instructions to accelerate deep learning inference on constrained embedded devices. It evaluates implementing a convolutional neural network (CNN) on an FPGA using these techniques. Specifically, it creates a CNN with simple loops, hardware looping, loop unrolling, and a combination. Hardware looping alone reduces clock cycles, while combining it with dot product instructions reduces cycles further. Evaluating the techniques on a Lenet-5 CNN with the MNIST dataset, hardware looping and dot product instructions show potential to enable low-cost microcontrollers to support deep learning applications.

Uploaded by

Bhargav Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Accelerated Deep Learning Inference from

Constrained Embedded Devices


Bhargav Bhat1, Abhay A Deshpande2
1
Department of Electronics and Communication Engineering, RV College of Engineering, Bengaluru, India
2
Department of Electronics and Communication Engineering, RV College of Engineering, Bengaluru, India
[email protected]; [email protected]

Abstract— Hardware looping is a feature of some processor example, protection concerns, security, high idleness,
instruction sets whose hardware can repeat the body of a correspondence power utilization, and dependability.
loop automatically, rather than requiring software These calculations perform massive arithmetic
instructions which take up cycles (and therefore time) to do computations. To accelerate these calculations at a
so. Loop Unrolling is a loop transformation technique that sensible cost in equipment, we can utilize an instruction
attempts to advance a program's execution speed to the set extension comprised of two instruction types-
detriment of its twofold size, which is a methodology known hardware loops and dot product instructions. The primary
as space–time tradeoff. A convolutional neural network is commitments of this paper are as per the following:
created with simple loops, with hardware looping, with loop
unrolling and with both hardware looping and loop • We propose an approach for computing neural network
unrolling, and a comparison is made to evaluate the functions that are advanced for the utilization of hardware
effectiveness of hardware looping and loop unrolling. The loops and dot product instructions.
hardware loops alone will add to a cycle check decline, while
the mix of hardware loops and dot product instructions will • The effectiveness of hardware loops and dot product
decrease the clock cycle tally further. The CNN is simulated instructions for performing deep learning functions are
on Xilinx Vivado 2021.1 running on Zync-7000 FPGA.  evaluated, and
Index Terms— Convolutional Neural Network, Deep
Learning, FPGA, Hardware Looping, Loop Unrolling, • To perform Lenet-5 Neural Network on Zync-7000
Vivado.
There have been different ways to deal with accelerate
profound learning capacities. The methodologies can be
I. INTRODUCTION sorted into two sets. In the first set, the approaches which
which attempt to enhance the size of the neural
organizations, or, all in all, upgrade the software.
Profound learning calculations have seen achievement in Approaches in the second set try to optimize the
a wide assortment of uses, for example, machine equipment or hardware on which neural networks are
interpretation, picture and discourse acknowledgment, running. As the set used here deals with hardware
and self-driving vehicles. In any case, these calculations optimization, we will focus on the related approaches for
have as of late acquired traction in the installed hardware optimization, and only momentarily observe the
frameworks space. Most implanted frameworks depend progress in software optimizations. A simple instruction
on modest microcontrollers with restricted memory limit, set is proposed to evaluate the effectiveness of hardware
and, subsequently, are commonly seen as not equipped loops and dot product instructions with fully upgraded
for running profound learning calculations. Nevertheless, assembly functions for the fully connected convolutional
we think about that progressions in pressure of neural neural network [1]. A custom instruction set architecture
organizations and neural organization engineering, is used for efficient realization of artificial neural
combined with an improved guidance set design, could networks and can be parameterized to a subjective fixed-
make microcontroller-grade processors appropriate for point design [2]. A CNN –specific Instruction set
explicit low-force profound learning applications. Such architecture is used which deploys the instruction
intricacy is substantially a lot for memory obliged parameters with high flexibility which embeds parallel
microcontrollers that have memory sizes indicated in computation and data reuse parameters in the instructions
kilobytes. Some embedded system designer’s work [3]. The instruction extensions and micro architectural
around the quandary of restricted resources by processing advancements to increase computational thickness and to
neural networks in the cloud. Nonetheless, this limit the amount of pressure towards shared memory
arrangement is restricted to regions with access to the hierarchy in RISC Processors [4]. A repetitive neural
Web. Cloud preparing likewise has different burdens, for organization into a convolutional neural network, and the
profound provisions of the picture were learnt in equal
utilizing the convolutional neural network and the
Manuscript received Month date, 2020; revised XX XX, 2020; intermittent neural organization [5]. The framework of
accepted XX XX, 2020.
Corresponding author: Name (email: XXXXXXX). Complex Network Classifier (CNC) is used by
Identify applicable funding agency here, project number. If none, integrating network embedding and convolutional neural
delete this line. network to tackle the problem of network classification.
By training the classifier on synthetic complex network B. Lenet
data, they showed that CNC can not only classify
Lenet (also called Lenet-5) is an exemplary
networks with high accuracy and robustness but can also
extract the features of the networks automatically [6]. A convolutional neural network which utilizes
valued prediction method is used to exploit the spatial convolutions, pooling and fully connected layers. Lenet is
correlation of zero-valued activations within the CNN used for handwritten digit recognition with the MNIST
output feature maps, thereby saving convolution dataset.
operations [7]. The impact of packet loss on data integrity
is reduced by taking advantage of the deep network’s C. MNSIT Dataset
ability to understand neural data and by using a data MNIST is the acronym for Modified National Institute of
repair method based on convolutions neural network [8]. Standards and Technology database. The MNIST
The instruction set simulation process is used soft-core database contains 60,000 training images and 10,000
Reduced Instruction Set Computer (RISC) processor. testing images. MNIST dataset pictures have the
They provided reliable simulation platform in creating measurements 28 x 28. To get the MNIST pictures
customizable instruction set for Application Specific
Instruction Set Processor (ASIP) [9]. RISC-V ISA measurement to the meet the necessities of the input
compatible processor and effects of instruction set is layer, the 28 x 28 pictures are cushioned or padded. Some
analyzed on the pipeline/micro-architecture design in of the test pictures from MNIST test dataset are as shown
terms of instructions encoding, functionality of in Fig 1:
instructions, instruction types, decoder logic complexity,
data hazard detection, register file organization and
access, functioning of pipeline, effect of branch
instructions, control flow, data memory access, operating
modes and execution unit hardware resources [10].

Deep learning algorithms are used increasingly in smart


applications. Some of them also run in Internet of Things
(IoT) devices. IoT Analytics reports that, by 2025, the
number of IoT devices will rise to 22 billion. The
motivation for our work stems from the fact that the rise
of the IoT will increase the need for low-cost devices
built around a single microcontroller capable of
supporting deep learning algorithms. Accelerating deep
learning inference in constrained embedded devices, Fig 1: Some Test pictures from MNIST dataset
presented in this paper, is our attempt in this direction. D. Building Blocks of Convolutional Neural Network
The remainder of this paper is organized as follows.  Convolutional Layer is the core center structure
Section II presents the related work in hardware and of CNN. In a CNN, the information is a tensor
software enhancements aimed at accelerating the neural with a shape: (number of data sources) x (input
network computation. Section III shows the methodology height) x (input width) x (input channels). In the
concerned with the project. Section IV shows the results wake of going through a convolutional layer, the
and the subsequent discussions regarding the obtained picture becomes disconnected to an element
results. Section V presents the conclusion and plans for map, likewise called an actuation map, with
additional work for the future.
shape: (number of data sources) x (highlight
map height)) x (include map width) x (feature
II. BACKGROUND map channels). Convolutional layers convolve
the information and pass its outcome to the
A. Convolutional Neural Networks following layer.
A convolutional neural organization (CNN, or  Pooling Layer: Convolutional networks may
ConvNet) is a class of artificial neural organization, most incorporate local and/or global pooling layers
generally applied to understand visual imagery. A CNN is alongside traditional convolutional layers.
a deep learning algorithm which can can take in an Pooling layers diminish the components of
information picture, allot significance, to different information by combining the outputs of neuron
viewpoints/objects in the picture and have the option to clusters at one layer into a single neuron in the
separate or differentiate one from the other. The pre- following layer. Local pooling combines small
processing needed in a CNN is a lot of lower when clusters, tiling sizes such as 2 x 2 are commonly
contrasted with other characterization calculations. While used. Global pooling acts on all the neurons of
in crude techniques channels are hand-designed, with the feature map. There are two common types of
enough preparing, CNNs can gain proficiency with these pooling in popular use: max and average. Max
channels/qualities. CNN has a lot of applications in pooling uses the maximum value of each local
decoding facial recognition, analyzing documents, cluster of neurons in the feature map, while
historic and environmental collections, understanding average pooling takes the average value.
climate, grey areas to see holistic view of what a human  Fully Connected Layers: Fully connected layers
sees, advertising and other fields. connect every neuron in one layer to each
neuron in another layer. It is equivalent to a
conventional multi-layer perceptron neural
network (MLP). The smoothed network goes
through a fully connected layer to characterize
the pictures.
 Receptive field: In neural networks, each neuron
gets input from some number of locations in the
past layer. In a convolutional layer, each neuron
receives input from only a restricted area of the
previous layer called the neuron's receptive field.
Ordinarily the area is a square (e.g. 5 by 5
neurons). Whereas, in a fully connected layer,
the receptive field is the entire past layer.
Accordingly, in each convolutional layer, each Fig 2: Architecture of Lenet-5 Model
neuron takes input from a bigger region in the
input than previous layers. This is expected due The following table is used to understand the
to applying the convolution over and over, architecture in more detail.
which takes into account the value of a pixel, as Layer #filters Filter Stride Size of Activati
/ Size Feature on
well as its encompassing pixels. When utilizing Neuron map Function
dilated layers, the number of pixels in the Input - - - 32x32x1
receptive field remains constant, but the field is Conv 1 6 5x5 1 28x28x6 tanh
more scantily populated as its measurements Avg 2x2 2 14x14x6
grow when combining the impact of several Pooling 1
Conv 2 16 5x5 1 10x10x1 tanh
layers. 6
 Weights: Every neuron in a neural network Avg 2x2 2 5x5x16
registers an output value by applying a particular Pooling 2
function to the input values received from the Conv 3 120 5x5 1 120 tanh
receptive field in the past layer. The function Fully - - - 84 tanh
Connected
that is applied to the input values is dictated by a 1
vector of weights and a bias (ordinarily real Fully - - - 10 Softmax
numbers). Learning consists of iteratively Connected
adjusting these biases and weights. 2
 The vector of weights and the bias are called
filters and represent specific features of the input The first layer is the input layer with highlight map
(e.g., a specific shape). A distinctive feature of size 32 x 32 x 1.
CNNs is that many neurons can share the same Then, at that point, we have the first convolution layer
filter. This decreases the memory footprint with 6 channels of size 5 x 5 and step is 1. The initiation
because a single bias and a single vector of work utilized at his layer is hyperbolic tangent (tanh).
weights are utilized across all receptive fields The output feature map is 28 x 28 x 6.
that share that filter, as opposed to each Then, we have an averagel pooling layer with channel
receptive field having its own bias and vector size 2 x 2 and step 1. The subsequent component map is
weighting. 14 x 14 x 6. Since the pooling layer doesn't influence the
quantity of channels.
III. LENET ARCHITECTURE After this comes the second convolution layer with 16
filters of 5 x 5 and step 1. Also, the activation function is
Lenet is a pre-trained Convolutional Neural Network tanh. Now the output size is 10 x 10 x 16.
Model used for recognizing handwritten and machine- Again comes the other average pooling layer of 2 x 2
printed characters. The organization has 5 layers with with step 2. As a result, the size of the feature map
learnable boundaries and thus named Lenet-5. It has three reduced to 5 x5 x16.
arrangements of convolution layers with a blend of The final pooling layer has 120 filters of 5 x5 with
average pooling. After the convolution and average stride 1 and activation function tanh. Now the output size
pooling layers, we have two fully connected layers. is 120.
Finally, a Softmax classifier is used to characterize the The next is a fully connected layer with 84 neurons
pictures into separate class. The lenet layers are depicted that result in the output to 84 values and the activation
in the following fig 2. function used here is again tanh.
The last layer is the output layer with 10 neurons and
Softmax function. The Softmax gives the likelihood that
an information point has a place with a specific class. The
most elevated worth is then anticipated.
This is the entire architecture of the Lenet-5 model.
The number of trainable parameters of this architecture is
around 60,000.

IV. RESULTS
The, TABLE gives the hardware cost occupied by our
design on the Zync-7000 board. It shows that the design
occupies 47% of LUTs, 19% of LUTRAMs, 28% of FFs,
59% of BRAMs, 54% of DSP, and 3% of BUFG.
Although, the implemented design can be displayed to
gives us idea on how the design have been distributed, Fig 4: Behavioral simulation on Xilinx Vivado
placed and routed on the selected Zync-7000 board.
C. Implementation
Resource Utilization Available Utilization % Once the implementation is reached the
LUT 25456 53200 47.85
implementation summary that summarizes all
LUTRAM 3478 17400 19.99
FF 30456 106400 28.62 implantation report will be provided. The Fig 5 depicts
BRAM 83.5 140 59.64 the implemented design.
DSP 120 220 54.55
BUFG 1 32 3.13

A. Software Model (MATLAB)


In the software model goes through the whole LeNet
CNN and give us its prediction. Fig 3 depicts result of the
software model on matlab.

Fig 5: Implemented Design of Lenet-5

Fig 3: Matlab Result

B. Hardware Model V. CONCLUSION


For the Hardware model, it gives us 10 outputs for the In this paper, we proposed Lenet-5 Convolutional
score of all 10 digits which are digit 0, 1, 2, 3, 4, 5, 6, 7, Neural Network with optimizations done using hardware
8, and 9. The highest score should be the correct answer. looping and dot product units, which provided high
So, you can see from all images the the highest score is accuracy when recognizing handwritten data using the
the correct value. Fig 4 shows the simulation result from MNSIT dataset. The hardware loops alone contribute to
Xilinx Vivado. In the below example where the number 0 24% cycle count decrease, while the dot products reduce
has been taken, the highest score 5 shows that 0 is the the cycle count by 27%. As embedded systems are highly
predicted number. price-sensitive, this is an important consideration. Getting
the sizes of neural networks down is an essential step in
expanding the possibilities for neural networks in
embedded systems.
An interesting topic for further research is Posit - an Algorithm," in IEEE Access, vol. 8, pp. 125731-125744, 2020,
alternative floating-point number format that may offer doi: 10.1109/ACCESS.2020.3006097.
[6] Xin, J. Zhang and Y. Shao, "Complex network classification with
additional advantages, as it has an increased dynamic convolutional neural network," in Tsinghua Science and
range at the same word size. Because of the improved Technology, vol. 25, no. 4, pp. 447-457, Aug. 2020, doi:
dynamic range, weights could be stored in lower 10.26599/TST.2019.9010055.
precision, thus, again, decreasing the memory [7] Shomron and U. Weiser, "Spatial Correlation and Value
Prediction in Convolutional Neural Networks," in IEEE Computer
requirements. Combining the reduced size requirements Architecture Letters, vol. 18, no. 1, pp. 10-13, 1 Jan.-June 2019,
with low-cost ISA improvements could make neural doi: 10.1109/LCA.2018.2890236
networks more ubiquitous in the price-sensitive [8] Y. Qie, P. Song and C. Hao, "Data Repair Without Prior
embedded systems market. Knowledge Using Deep Convolutional Neural Networks," in IEEE
Access, vol. 8, pp. 105351-105361, 2020, doi:
10.1109/ACCESS.2020.2999960.
[9] A. J. Salim, S. I. M. Salim, N. R. Samsudin and Y. Soo,
CONFLICT OF INTEREST "Customized instruction set simulation for soft-core RISC
processor," 2012 IEEE Control and System Graduate Research
The authors declare no conflict of interest. Colloquium, Shah Alam, Selangor, 2012, pp. 38-42, doi:
10.1109/ICSGRC.2012.6287132.
[10] A. Raveendran, V. B. Patil, D. Selvakumar and V. Desalphine, "A
AUTHOR CONTRIBUTIONS RISC-V instruction set processor-micro-architecture design and
analysis," 2016 International Conference on VLSI Systems,
Bhargav Bhat conducted the research, analyzed data, Architectures, Technology and Applications (VLSI-SATA),
and wrote the paper; Dr. Abhay A Deshpande acted as Bangalore, 2016, pp. 1-7, doi: 10.1109/VLSI-
supervisor and advisor of the research and edited the SATA.2016.7593047.
paper.

REFERENCES Bhargav Bhat is a post graduate student of


VLSI Design and Embedded systems. He is
(Periodical style)
currently pursuing his final year in Master of
[1] J. Vreča et al., "Accelerating Deep Learning Inference in Technology degree at RV College of
Constrained Embedded Devices Using Hardware Loops and a Dot Engineering, Bengaluru, India. His areas of
Product Unit," in IEEE Access, vol. 8, pp. 165913-165926, 2020, interest are VLSI, RTL design and Automotive
doi: 10.1109/ACCESS.2020.3022824. Embedded Systems.
[2] D. Valencia, S. F. Fard and A. Ali mohammad, "An Artificial
Neural Network Processor With a Custom Instruction Set
Architecture for Embedded Applications," in IEEE Transactions
on Circuits and Systems I: Regular Papers, vol. 67, no. 12, pp.
5200-5210, Dec. 2020, doi: 10.1109/TCSI.2020.3003769.
[3] X. Chen and Z. Yu, "A Flexible and Energy-Efficient Dr. Abhay Deshpande is currently working as
Convolutional Neural Network Acceleration With Dedicated ISA Associate Professor in the Department of
and Accelerator," in IEEE Transactions on Very Large Scale Electronics and Communication Engineering at
Integration (VLSI) Systems, vol. 26, no. 7, pp. 1408-1412, July RV College of Engineering, Bengaluru. He is
2018, doi: 10.1109/TVLSI.2018.2810831. having 8 years of teaching experience and 1
[4] M. Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor year of industry experience. He received Ph.D
Loi, "Near-Threshold RISC-V Core With DSP Extensions for degree in control system from VTU belagavi
Scalable IoT Endpoint Devices," in IEEE Transactions on Very and M.Tech from VTU Belagavi. His research
Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700- interests include control systems and Robotics
2713, Oct. 2017, doi: 10.1109/TVLSI.2017.2654506. and DSP.
[5] Y. Tian, "Artificial Intelligence Image Recognition
Method Based on Convolutional Neural Network

You might also like