Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar
Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar
INTRODUCTION
In recent years, machine learning has seen widespread implementation in several software and cloud
services. It has been known since 2006 that a certain class of neural networks made up of algorithms
outperforms previous state-of-the-art techniques in a variety of machine learning tasks. The artificial
neural networks known as CNNs, or Convolutional Ne have been shown to be highly effective in
addressing difficult machine learning problems like image and audio recognition. The acronym DL
describes the computing algorithm used. Deep Learning, or multilayer neural networks, are
notoriously resource-hungry due to their many hidden layers. Google's cat-recognition system (1 a
billion dollars neuronal connections) and Baidu's Brain system (100 billion dollars neuronal
connections) are just two examples of how the size of Deep Learning wireless networks is growing
in tandem with the accuracy requirements and intricate nature of practical applications. As a result,
optimizing the performance of large-scale neural networks trained using deep learning has emerged
as a central area of study.
FPGA (Field-Programmable Logic Array) is a key way to speed up deep learning algorithms because
of its high performance along with low power consumption. Due to the exponential growth of data,
data centers consume a lot of energy. In example, it is expected that by 2020 [4] the United States'
data centers' yearly power usage would have increased to almost 140 billion kilowatt-hours.
Therefore, particularly for large-scale deep learned neural network models, it presents major
problems when creating high-performance deep learning systems with minimal electrical
consumption. Field programmable gate arrays, application specialized integrated circuits, and
92
N.Vimala 2023 Advanced Engineering Science
graphics processing units are now state-of-the-art methods for speeding up deep learning procedures.
Hardware accelerators, including FPGAs and ASICs, can deliver at least reasonable performance
with reduced power consumption compared to GPU acceleration. Building large and complicated
DNNs on hardware accelerators is difficult because of the hardware's restricted processing power,
memory, and input/output (I/O) bandwidths. The ASIC design process is more time-consuming, and
the resulting lack of adaptability leaves designers wanting. DianNao, introduced by Chen et al. [6]
as a ubiquitous algorithmic learning silicon accelerator, is widely credited with launching the deep
learning processor industry. This shifts the paradigm for neural network-focused hardware
accelerators for machine learning. However, DianNao cannot adjust to the needs of various
applications since it is not built on programmable hardware like FPGA. Current work on FPGA
acceleration includes a technique to speed up the Boltzmann machine with limited capabilities
(RBM) that was built by Ly and Chow [5]. The RBM method was optimized by designing special
hardware processing cores. The RBM was also the subject of an FPGA-based accelerator built by
Kim et al. [7]. Each RBM processing module handles a tiny subset of the network's nodes, allowing
for parallel processing. Similar FPGA-based artificial intelligence acceleration are shown in other
research [9]. The FPGA-based accelerator reported by Yu et al. [8] is limited in its ability to adapt to
dynamic changes in network size and topology. In conclusion, these research efforts center on
effectively executing a single deep learning algorithm, but the question of how to scale up the size
of neural networks while maintaining their adaptability remains unanswered.
To address these issues, we introduce a deep learning accelerate unit (DLAU) that can be scaled up
to accelerate the computationally intensive kernels of deep learned algorithms. To design large-scale
neural networks, we make use of tiling methods, FIFO buffers, and streams to reduce memory
transfer operations and maximize CPU utilisation. The following are some of the ways in which this
method differs from that of the aforementioned literatures.
1) We use tiling approaches to split the large-scale input data in order to investigate the localization
of the learning application. To optimize for performance against cost, the DLAU framework may be
set up to run on a variety of tile sizes. As a result, the FPGA-based accelerator may be easily adjusted
to work with a wide variety of machine learning programs.
Activation function accelerate unit (AFAU), part sum collection unit (PSAU), and tiled matrices
multiplication unit (TMMU) are the three completely pipelined processing units that make up the
DLAU accelerator. These building blocks may be used to construct a wide variety of more complex
networks, including conventional and deep neural networks. Because of this, FPGA-based
accelerators are more scalable than ASIC-based ones.
93
N.Vimala 2023 Advanced Engineering Science
Layering is a common method for structuring neural networks. 'Nodes' are the building blocks of a
layer; each one contains a 'activation function' and is connected to others via "links." The 'input layer'
receives patterns from the user and relays them to the 'hidden layers' via a set of weighted
'connections,' where the real processing occurs. In the diagram below, the inferred solution is
produced from the 'output layer' to which the hidden layers join.
LITERATURE SURVEY
DjiNN and Tonic: Distributed Neural Networks as a Service and Its Effect on Future Large-
Scale Computers in Warehouses.
Abstract: As voice-activated assistants like Apple's Siri, Google Now, Microsoft's Cortana, and
Amazon's Echo gain popularity, online service providers are turning to massive neural networks, or
deep neural networks (DNN) to tackle difficult machine learning tasks like image processing,
detection of speech, and NLP. There are many unanswered problems regarding the optimal
configuration of today's warehouse-scale computers (WSCs) and the construction of a server
architecture tailored specifically for DNN applications. Tonic Suite is a set of seven end-to-end apps
for processing images, voice, and language, and DjiNN is an open architecture for DNN as a service
in WSCs that we introduce in this work. We utilize DjiNN to plan a high-throughput DNN solution
based on huge GPU server architectures, and we shed light on the wide range of application-specific
94
N.Vimala 2023 Advanced Engineering Science
features. We examine numerous design issues for future WSC designs after analyzing the
performance, bandwidth, and power features of DjiNN / Tonic Suite. We compare a WSC with an
aggregated GPU pool to one made up of homogenous integrated GPU servers, and we calculate the
resulting difference in total cost of ownership. We increase DNN speed by over 120x on a Gtx K40
GPU for all but a single app (Facial Recognition, by 40x). For three of the seven applications, we
see near-linear scalability on a GPU server built from eight NVIDIA K40s (an gain of roughly a
thousand times in throughput). We also discover that, depending on the workload's makeup, the total
cost of acquisition for GPU-enabled WSCs is 4-20 lower than for CPU-only implementations.
Johann Hauswald penned the work. Cang Yiping Laurenzano, Michael A. Trevor Mudge Quan
Cheng Li Dreslinski, Ronald G. University of Michigan, Ann Arbor, MI, USA Brian Mars Lingjia
Tan Clarity Lab
Deep convolutional neural network acceleration by field-programmable gate array
optimization.
abstract: Since convolutional neural networks (CNNs) may achieve great accuracy by modeling the
actions of biological optic neurons, they have found widespread usage in the field of image
recognition. Rapid development of new applications based on artificial intelligence algorithms has
helped advance both theory and practice in this area. With its excellent performance, ease of
reconfigurability, and rapid creation round, etc., the FPGA technology has inspired a number of
suggested accelerators for deep CNN. The accelerator design area has not been fully utilized, despite
the fact that existing FPGA accelerators have proven to offer superior performance than generic
processors. The memory bandwidth offered by an FPGA platform may not be adequate for the
calculation throughput, which is a major issue. Therefore, current methods are not optimal due to
inefficient use of logic resources and memory bandwidth. Meanwhile, this issue is made worse by
the growing complexity the scalability of systems that employ deep learning. To address this issue,
we provide a method of analytical design based on the roofline model. We use several optimization
strategies, such as loop tiles and transformation, to do a quantitative analysis of the compute
throughput and needed memory bandwidth of each variation of a CNN architecture. The rooine
model may then be used to determine which option achieves the desired results while using the fewest
available FPGA resources. We use the creation of a CNN accelerator on a VC707 FPGA board as a
case study, contrasting our method with others. Our solution is substantially more efficient than prior
methods, with a peak performance of 61.62 GFLOPS at a working frequency of 100MHz.
Jason Cong, Peng Li, and Chen Zhang wrote the paper.
95
N.Vimala 2023 Advanced Engineering Science
may be utilized to improve the scalability and performance of neural network implementations by
using the networks' intrinsic parallelism. One common neural network architecture that lends itself
well to hardware implementations is the Restricted Danube machine, which we shall explore in
detail. With the suggested general-purpose hardware architecture, the O(n22) issue can be solved
with just O(n) resources, making it feasible to implement. A 100MHz Xilinx Virtex II-Pro XC2VP70
FPGA is used to test the framework. With these means, a Restricted Boltzmann machine with 128
by 128 nodes may operate at a pace of 1.02 billion connection-updates per second, which is a 35-
fold increase in speed over an optimized C code running on a 2.8 GHz Intel CPU.
Authors: D. L. Ly and P. Chow.
"Stacked neural networks," or multi-layered networks, are what we mean when we talk about "deep
learning." Nodes are what make up the layers. A node is only a location for computing, and it is
based on the analogy of a neurons in the mind of a person, which "fires" when it receives enough
stimulation. For the purpose of learning a task, a node combines data input wit a set of coefficients,
and or weights, that either boost or reduce the relevance of that input. (Which input is best for error-
free data classification, for instance?) The activation function of a node decides whether or not the
signal from its input-weight product should be passed on to subsequent layers of the network to
influence the final outcome, such as a classification.
The number of node layers through which input flows in a multistep process of pattern recognition
is what sets deep-learning networks apart from the more conventional single-hidden-layer neural
networks.
Fig.2 illustrates a deep neural network for handwritten digits recognition which composed of Mnist
is an array of handwritten digits, and its structure is as follows: one input layer, numerous hidden
layers, and one output layer. In this study, DNNs serve as an illustration. Both the prediction process
and the training process are computational modes in DNNs. The weight coefficients obtained during
96
N.Vimala 2023 Advanced Engineering Science
the training phase are used to calculate the output for each input during the prediction step, which is
a feed forward calculation. There are two phases to the training process: pre-training, in which the
connection weights connecting the units in neighboring layers are tuned locally using the training
datasets [4]; and global training, in which the connection weights are tuned worldwide using the
Back Spreading algorithm (BP algorithm). We use the prediction method instead of the training
procedure due to technological and commercial constraints.
This section provides a high-level overview of RBM mechanics, including some of the basic
vocabulary, mathematical foundation, and technique involved. A RBM is a type of stochastic,
generative neural network. Given a set of patterns, a neural network is going to able to construct an
internal model that can recognize fresh data in the same distribution, making it useful for modeling
the statistical habits of a given collection of data.
Additionally, the network may produce data from that distribution thanks to the underlying model's
generative characteristic. Because it employs a probabilistic strategy, the RBM may be thought of as
stochastic. The RBM uses statistical procedures to ascertain the probability distribution of statistical
attributes of a dataset. RBMs are distinct from conventional neural networks in both their design and
their applications due to their generative and stochastic qualities.
The RBM may be trained like any other neural network. The settings can be adjusted repeatedly in
accordance with a particular data set using the learning rules rather than being set manually. As a
result, the RBM is applied to a collection of data vectors that serve as the training set. The RBM uses
iterative processing of the training data to get the desired result. The RBM has reached an adequate
level of training at this point. The test data is a fresh collection of data vectors that haven't been used
in training yet, and they're used to check the model's behavior.
The RBM is a synthesis of neural network theory and statistical mechanics, therefore its language
reflects the origins of these two disciplines. Nodes are commonly used to describe processing
components in neural networks. The nodes can be in one of two possible states—on or off. There are
two sets of nodes in an RBM—the external layer and the internal layer. The data in the network's
hidden layer is represented internally, while the data in the visible layer is used for input/output.
Nodes on the same layer are not connected to one another, but all other nodes are. The weights
assigned to these associations are what the RBM uses to train itself.
97
N.Vimala 2023 Advanced Engineering Science
The ith and jth nodes' binary states in the visible and hidden layers, respectively, are denoted by vi
and hj, whereas wi,j represents the connection weight between the ith and jth nodes. In Fig. 3 we
have a simplified diagram that encapsulates all of the relevant vocabulary and nomenclature.
PROPOSED SYSTEM
To address these issues, we introduce DLAU, a scalable deep learning accelerator unit designed to
expedite the algorithm's kernel computations. In particular, we design large-scale neural networks
by making use of tiling methods, FIFO buffers, and pipelines to reduce the number of memory
transfer operations and maximize the reusability of the computational nodes. The following
contributions set this method apart from its predecessors.
First, we use tiling methods to divide up the massive input data so that we may investigate the
application's locality using deep learning. To take advantage of the tradeoffs between speedup and
hardware costs, the DLAU architecture may be set up to handle multiple amounts of tile data.
Therefore, the accelerator based on FPGAs is more flexible in its ability to adapt to various machine
learning tasks.
2) The DLAU accelerator has a tiling matrix multiplication unit (TMMU), a part sum accumulation
unit (PSAU), and an activation function acceleration unit (AFAU), all of which work in parallel.
Various System
98
N.Vimala 2023 Advanced Engineering Science
Iterative calculations with minimal conditional branch operations are common in large-scale
DNNs, making them a good candidate for hardware parallelization. In this study, we begin by
utilizing the profiler to investigate the hotspot. Matrix multiplication (MM), activation, and vector
operations account for a large fraction of the total execution time, as seen in Fig. 1. MM are crucial
to the success of the three most emblematic activities, which are feed forward, RBM, and BP. In
particular, it necessitates 99.1% of all BP activities, 100% of RBM operations, and 98.6% of all feed
forward operations. Only 1.40 percent, 1.48 percent, and 0.42 percent of the three operations are used
by the activation function. Profiling experiments show that MM accelerators may greatly boost
system speedup when properly designed and implemented.
However, in comparison to GPU and CPU optimization approaches, FPGA implementations
face a substantial difficulty due to the large memory bandwidth and compute resources required to
enable the parallel processing. In this study, we apply tile algorithms to split the enormous input data
set into tiled subsets and thereby attack the problem. Each custom hardware accelerator can store the
tiled portion of data in its own buffer before processing it. The accelerator design is utilized to support
the massive neural networks. Also, the hardware accelerators can be used in parallel with the data
access for each tiled subset.
In particular, the output neurons in one iteration become the input neurons in the following
iteration. At each iteration, the weights matrix is multiplied by each column of input neurons to
produce a new set of output neurons. Algorithm 1 shows how the input values are tiled and multiplied
by the matching
99
N.Vimala 2023 Advanced Engineering Science
weights. After each component's sum is determined, the total is tallied. We additionally tiled the
weight matrix to match to the neuron input/output sizes. This results in a huge reduction in hardware
requirements for the accelerator, as its price is now just proportional to the size of the tiles. The tiling
approach is effective because it permits the deployment of extensive networks while making use of
relatively few physical resources. FPGA technology has an additional benefit over GPU design
because of its pipelined hardware implementation, which utilizes huge parallel SIMD architectures
to boost overall speed and throughput. In this paper, we implement the specialized accelerator to
speed up the MM and activation functions, which, as shown in Table I of the profiling results, are
common but important computational parts during the prediction process and the training process in
deep learning algorithms.
DUAL ARCHITECTURE AND EXECUTION MODELL.
The DLAU accelerator, DDR3 memory controller, direct memory access module, and
integrated processor are all shown in Fig. 6. The embedded processor talks to DLAU through JTAG-
UART and provides a programming interface for the users. The DLAU accelerator is activated, input
data and weight matrix are stored in internal BRAM blocks, and the results are delivered to the user
after execution. The DLAU is designed as a modular component that may be configured for a wide
variety of uses. The DLAU is made up of three processing units set up in a sequential fashion:
1) TMMU;
2) PSAU; and
3) AFAU.
When executed, DLAU retrieves tiled data from memory through DMA, processes it using
each of the three processors in turn, and then stores the processed data back in memory. The
following are distinguishing characteristics of the DLAU accelerator design.
100
N.Vimala 2023 Advanced Engineering Science
TMMU Architecture
Multiplication and adding up are TMMU's responsibilities. Part sums are calculated using
TMMU, which is tailored to take use of the weights' data localisation. The component sums are sent
to PSAU through an output FIFO buffer, while the input FIFO buffer gets data via DMA. In Fig.7,
we show the TMMU diagram in more detail by specifying a tile size of 32. The weight matrix is read
by TMMU from the input buffer and then distributed across the BRAMs in a 32 by i fashion, where
n is the total number of BRAMs and i is the row number of the weight matrix. Then, TMMU will
start storing the tiled node data in a buffer. When TMMU initially begins execution, it loads the tiled
32 values into register Reg_a. Every cycle, TMMU receives the next node from the input buffer and
writes it to the registers named Reg_b in parallel with the calculation. As a result, Reg_a and Reg_b
are interchangeable registers.
We employ a pipelined binary adder tree structure to execute the computation quickly and
101
N.Vimala 2023 Advanced Engineering Science
efficiently. Figure 7 shows that BRAMs and registers are used to store information about nodes and
their associated weights. Coarse-grained accelerators are used in a time-sharing pipeline. As a result,
the TMMU unit is able to generate a part sum result at the end of each clock cycle using this approach.
To keep the throughput with other processing units constant, AFAU, like PSAU, employs input and
output buffers. To be more specific, we keep a and b's values in two different BRAMs. Every time
the clock ticks over, the sigmoid function used in the AFAU calculation gets put through its paces.
Therefore, the DLAU accelerator architecture's maximum throughput is guaranteed by completely
pipelined operation across all three processing units.
102
N.Vimala 2023 Advanced Engineering Science
RESULTS
103
N.Vimala 2023 Advanced Engineering Science
CONCLUSION
In this study, we introduce DLAU, an FPGA-based deep learning accelerator designed for scalability
and adaptability. For large-scale neural networks, the DLAU's three pipelined processing units come
in handy. In order to compute frequently by time-sharing the arithmetic logic, DLAU utilizes tile
methods to split the input node data into smaller groupings. In experiments conducted on a prototype
Xilinx FPGA, it was shown that DLAU could yield a speedup of 36.1% with a little increase in
hardware cost and minimal increase in power consumption.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444,
2015.
[2] [2] J. Hauswald et al., “DjiNN and Tonic: DNN as a service and its implications for future
warehouse scale computers,” in Proc. ISCA, Portland, OR, USA, 2015, pp. 27–40.
[3] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional neural
networks,” in Proc. FPGA, Monterey, CA, USA, 2015, pp. 161–170.
[4] P. Thibodeau. Data Centers are the New Polluters. Accessed on Apr. 4, 2016. [Online]. Available:
http://www.computerworld.com/ article/2598562/data-center/data-centers-are-the-new-
polluters.html.
[5] D. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted Boltzmann
machines,” in Proc. FPGA, Monterey, CA, USA, 2009, pp. 73–82.
[6] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-
learning,” in Proc. ASPLOS, Salt Lae City, UT, USA, 2014, pp. 269–284.
[7] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable restricted
104
N.Vimala 2023 Advanced Engineering Science
Boltzmann machine FPGA implementation,” in Proc. FPL, Prague, Czech Republic, 2009, pp. 367–
372.
[8] Q. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process accelerator
based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.
[9] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural network,”
in Proc. FPGA, Monterey, CA, USA, 2016, pp. 26–35.
[10] Capacity trust assessment for multi-hop routing in wireless sensor networks,Sowmya
Gali1, , Alekya Himabindu B.1, Nagamani V.1, Jayamangala S.1, Munawwar S.1, Mallikarjuna
Rao Y, E3S Web of Conferences 391, 01181 (2023).
[11] Multiple Sensor Data Fusion Based Top-Down Embedded System for Automatic Plant Pot
Watering,B Saraswati, MV Subramanyam, Journal of Algebraic Statistics 13 (1), 884-891,2022.
[12] “Despite Non-Uniform Motion Blur, Illumination, And Noise, Face Recognition”,International
Journal Of Progressive Research In Engineering Management And Science (IJPREMS) vol. 02, issue
08, august 2022, pp : 25-29 e-issn : 2583-1062 impact factor : 2.265.
[13] “Vlsi Implementation Of Ternary Adder And Multiplier Using Tanner Tool”, Journal of
Pharmaceutical Negative Results , Volume 13 ,Special Issue 5 ,2022.
[14] Novel Design of Ternary Arithmetic Circuits,BA Himabindu, DS Kumar, G Mahendra, DM
Srikanth, KD ReddyIJIRE Volume 3, Issue 4 (July-August 2022), PP: 172-183.,2022.
[15] CARRY SELECT ADDER DESIGN USING D-LATCH WITH LESS DELAY AND MORE
POWER EFFICIENT,OM Chandrika, BAH Bindu,IJMTES,Vol.3,Issue.7,2016.
[16] Automatic Gas Alerting System,OM Chandrika, BAH Bindu,Imperial Journal of
Interdisciplinary Research (IJIR) 2 (6), 2454-1362,2016
[17]Watermarking of digital images with iris based biometric data using wavelet and SVD,AH
Bindu, V Saraswati,Int. J. Eng. Dev. Res 4 (1), 726-731,2016
[18] Design of High Performance Pipelined Data Encryption Standard (DES) Using Xilinx Virtex-6
FPGA Technology,SB Sridevi, BA Himabindu,The International Journal of Science and
Technoledge 2 (2), 53,2014.
[19] CH.Nagaraju, Dr.Anil Kumar Sharma, Dr.M.V.Subramanyam,” A Review on BER
Performance Analysis and PAPR Mitigation in MIMO OFDM Systems”, International Journal of
Engineering Technology and Computer Research, Vol.3,No.3,PP.237-238, June, 2015.
[20] V.Jyothi, Dr.M. V. Subramanyam “ QIF-NDMRP-QOS aware Interference free node-Disjoijnt
multipath routing protocol using source-Destination edge based path selection scheme.
’Webology’.pp-773-786 , ISSN:1735- 188X, Vol.19, No:2,2022.(scopus)
[21]. Y.Mallikarjuna Rao, M.V.Subramanyam and K. Satyaprasad, “An efficient resource
management algorithms for mobility management in wireless mesh networks”, IEEE Conference on
energy, communication, data analytics and soft computing, SKR Engineering College, Chennai,
2017.
[22].Mallikarjuna Rao, M.V.Subramanyam and K. Satyaprasad, “Performance analysis of energy
aware QoS enabled routing protocol for wireless mesh networks”, International journal of smart grid
and green communications
[23]. Y.Mallikarjuna Rao et.al “Design and Performance Analysis of Buffer Inserted On-Chip Global
105
N.Vimala 2023 Advanced Engineering Science
106