0% found this document useful (0 votes)

23 views15 pages

Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar

The document presents the Deep Learning Accelerator Unit (DLAU), a scalable architecture designed for large-scale neural networks using FPGA technology, aimed at enhancing performance while reducing power consumption. It highlights the challenges of optimizing deep learning systems and introduces innovative methods such as tiling approaches and pipelined processing units to improve computational efficiency. Experimental results demonstrate that the DLAU can achieve performance levels comparable to traditional CPUs, making it a promising solution for deep learning applications.

Uploaded by

sandypsetty53

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar

Uploaded by

sandypsetty53

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

ISSN: 2096-3246

Volume 55, Issue 02, September, 2023

DLAU: A SCALABLE DEEP LEARNING ACCELERATOR UNIT ON FPGA

1N.Vimala, 2B.Alekya Himabindu, 3Y.Mallikarjuna Rao, 4G.Sowmya, 5S.Girish Babu,

6M.Mahesh Kumar
1
M.Tech Student, Santhiram Engineering College,Nandyal
2,5,6
Assistant Professor, Santhiram Engineering College,Nandyal
3,4
Associate Professor, Santhiram Engineering College,Nandyal
ABSTRACT
As a new subfield of neural networks, deep learning has proven to be highly effective at tackling
difficult learning challenges. It is becoming increasingly difficult to build high-performance
solutions of neural networks with deep learning as the dimension of the nodes grows in response to
the needs of actual applications. In this study, we propose a deep neural accelerator unit (DLAU), a
scalable accelerating architecture for large-scale neural networks using deep learning that uses field-
programmable gate matrix (FPGA) as the computer chip prototype, to achieve both performance
gains and reduced power consumption. The DLAU accelerator uses tiling approaches to investigate
locality in deep learning applications and makes use of three pipelined computing units to increase
throughput. The DLAU accelerated is capable of speeds comparable to those of Intel Core2 CPUs,
as shown by experiments conducted on a cutting-edge Xilinx FPGA board.
Index Terms-Deep learning, field-programmable gate array (FPGA), hardware accelerator, neural
network.

INTRODUCTION
In recent years, machine learning has seen widespread implementation in several software and cloud
services. It has been known since 2006 that a certain class of neural networks made up of algorithms
outperforms previous state-of-the-art techniques in a variety of machine learning tasks. The artificial
neural networks known as CNNs, or Convolutional Ne have been shown to be highly effective in
addressing difficult machine learning problems like image and audio recognition. The acronym DL
describes the computing algorithm used. Deep Learning, or multilayer neural networks, are
notoriously resource-hungry due to their many hidden layers. Google's cat-recognition system (1 a
billion dollars neuronal connections) and Baidu's Brain system (100 billion dollars neuronal
connections) are just two examples of how the size of Deep Learning wireless networks is growing
in tandem with the accuracy requirements and intricate nature of practical applications. As a result,
optimizing the performance of large-scale neural networks trained using deep learning has emerged
as a central area of study.
FPGA (Field-Programmable Logic Array) is a key way to speed up deep learning algorithms because
of its high performance along with low power consumption. Due to the exponential growth of data,
data centers consume a lot of energy. In example, it is expected that by 2020 [4] the United States'
data centers' yearly power usage would have increased to almost 140 billion kilowatt-hours.
Therefore, particularly for large-scale deep learned neural network models, it presents major
problems when creating high-performance deep learning systems with minimal electrical
consumption. Field programmable gate arrays, application specialized integrated circuits, and

92
N.Vimala 2023 Advanced Engineering Science

graphics processing units are now state-of-the-art methods for speeding up deep learning procedures.
Hardware accelerators, including FPGAs and ASICs, can deliver at least reasonable performance
with reduced power consumption compared to GPU acceleration. Building large and complicated
DNNs on hardware accelerators is difficult because of the hardware's restricted processing power,
memory, and input/output (I/O) bandwidths. The ASIC design process is more time-consuming, and
the resulting lack of adaptability leaves designers wanting. DianNao, introduced by Chen et al. [6]
as a ubiquitous algorithmic learning silicon accelerator, is widely credited with launching the deep
learning processor industry. This shifts the paradigm for neural network-focused hardware
accelerators for machine learning. However, DianNao cannot adjust to the needs of various
applications since it is not built on programmable hardware like FPGA. Current work on FPGA
acceleration includes a technique to speed up the Boltzmann machine with limited capabilities
(RBM) that was built by Ly and Chow [5]. The RBM method was optimized by designing special
hardware processing cores. The RBM was also the subject of an FPGA-based accelerator built by
Kim et al. [7]. Each RBM processing module handles a tiny subset of the network's nodes, allowing
for parallel processing. Similar FPGA-based artificial intelligence acceleration are shown in other
research [9]. The FPGA-based accelerator reported by Yu et al. [8] is limited in its ability to adapt to
dynamic changes in network size and topology. In conclusion, these research efforts center on
effectively executing a single deep learning algorithm, but the question of how to scale up the size
of neural networks while maintaining their adaptability remains unanswered.
To address these issues, we introduce a deep learning accelerate unit (DLAU) that can be scaled up
to accelerate the computationally intensive kernels of deep learned algorithms. To design large-scale
neural networks, we make use of tiling methods, FIFO buffers, and streams to reduce memory
transfer operations and maximize CPU utilisation. The following are some of the ways in which this
method differs from that of the aforementioned literatures.

1) We use tiling approaches to split the large-scale input data in order to investigate the localization
of the learning application. To optimize for performance against cost, the DLAU framework may be
set up to run on a variety of tile sizes. As a result, the FPGA-based accelerator may be easily adjusted
to work with a wide variety of machine learning programs.
Activation function accelerate unit (AFAU), part sum collection unit (PSAU), and tiled matrices
multiplication unit (TMMU) are the three completely pipelined processing units that make up the
DLAU accelerator. These building blocks may be used to construct a wide variety of more complex
networks, including conventional and deep neural networks. Because of this, FPGA-based
accelerators are more scalable than ASIC-based ones.

Motivation for FPGA implementation.

There have been several attempts to improve the speed of neural networks by implementing them in
hardware. From analog computers to very large scale integration (VLSI) systems, many have tried
and failed to create widely used hardware. Common issues with such systems include low resolution,
small networks, and a cumbersome or nonexistent user interface.
The Basics of Neural Networks

93
N.Vimala 2023 Advanced Engineering Science

Layering is a common method for structuring neural networks. 'Nodes' are the building blocks of a
layer; each one contains a 'activation function' and is connected to others via "links." The 'input layer'
receives patterns from the user and relays them to the 'hidden layers' via a set of weighted
'connections,' where the real processing occurs. In the diagram below, the inferred solution is
produced from the 'output layer' to which the hidden layers join.

Fig.1 Neural Network.

Applications of neural networks.

To get the most out of a neural network, the algorithm you are trying to simulate should have a high
tolerance for error, as neural systems are general approximators. Hence, artificial neural networks
shouldn't be used to keep financial records straight. In cases when the data volume, number of
variables, or variety are exceptionally high, however, these methods excel at: capturing correlations,
or detecting regularities within a group of patterns.
the connections between the variables are not clearly understood, or cannot be fully described using
traditional methods.

LITERATURE SURVEY
DjiNN and Tonic: Distributed Neural Networks as a Service and Its Effect on Future Large-
Scale Computers in Warehouses.
Abstract: As voice-activated assistants like Apple's Siri, Google Now, Microsoft's Cortana, and
Amazon's Echo gain popularity, online service providers are turning to massive neural networks, or
deep neural networks (DNN) to tackle difficult machine learning tasks like image processing,
detection of speech, and NLP. There are many unanswered problems regarding the optimal
configuration of today's warehouse-scale computers (WSCs) and the construction of a server
architecture tailored specifically for DNN applications. Tonic Suite is a set of seven end-to-end apps
for processing images, voice, and language, and DjiNN is an open architecture for DNN as a service
in WSCs that we introduce in this work. We utilize DjiNN to plan a high-throughput DNN solution
based on huge GPU server architectures, and we shed light on the wide range of application-specific
94
N.Vimala 2023 Advanced Engineering Science

features. We examine numerous design issues for future WSC designs after analyzing the
performance, bandwidth, and power features of DjiNN / Tonic Suite. We compare a WSC with an
aggregated GPU pool to one made up of homogenous integrated GPU servers, and we calculate the
resulting difference in total cost of ownership. We increase DNN speed by over 120x on a Gtx K40
GPU for all but a single app (Facial Recognition, by 40x). For three of the seven applications, we
see near-linear scalability on a GPU server built from eight NVIDIA K40s (an gain of roughly a
thousand times in throughput). We also discover that, depending on the workload's makeup, the total
cost of acquisition for GPU-enabled WSCs is 4-20 lower than for CPU-only implementations.
Johann Hauswald penned the work. Cang Yiping Laurenzano, Michael A. Trevor Mudge Quan
Cheng Li Dreslinski, Ronald G. University of Michigan, Ann Arbor, MI, USA Brian Mars Lingjia
Tan Clarity Lab
Deep convolutional neural network acceleration by field-programmable gate array
optimization.

abstract: Since convolutional neural networks (CNNs) may achieve great accuracy by modeling the
actions of biological optic neurons, they have found widespread usage in the field of image
recognition. Rapid development of new applications based on artificial intelligence algorithms has
helped advance both theory and practice in this area. With its excellent performance, ease of
reconfigurability, and rapid creation round, etc., the FPGA technology has inspired a number of
suggested accelerators for deep CNN. The accelerator design area has not been fully utilized, despite
the fact that existing FPGA accelerators have proven to offer superior performance than generic
processors. The memory bandwidth offered by an FPGA platform may not be adequate for the
calculation throughput, which is a major issue. Therefore, current methods are not optimal due to
inefficient use of logic resources and memory bandwidth. Meanwhile, this issue is made worse by
the growing complexity the scalability of systems that employ deep learning. To address this issue,
we provide a method of analytical design based on the roofline model. We use several optimization
strategies, such as loop tiles and transformation, to do a quantitative analysis of the compute
throughput and needed memory bandwidth of each variation of a CNN architecture. The rooine
model may then be used to determine which option achieves the desired results while using the fewest
available FPGA resources. We use the creation of a CNN accelerator on a VC707 FPGA board as a
case study, contrasting our method with others. Our solution is substantially more efficient than prior
methods, with a peak performance of 61.62 GFLOPS at a working frequency of 100MHz.
Jason Cong, Peng Li, and Chen Zhang wrote the paper.

A high-performance FPGA architecture for restricted Boltzmann machines.

Abstract: Although neural networks have been widely used and have shown impressive
results in academic settings, they have seen very few commercial or industrial applications. The fact
the neural systems are frequently implemented as program running on processors that are general-
purpose is a major factor in their slow rate of adoption. Due to the fact that most software algorithms
for implementing a neural network are O(n2) issues, neural networks fall short in terms of the
scalability and efficiency needed in real-world applications. In this research, we look at how FPGAs

95
N.Vimala 2023 Advanced Engineering Science

may be utilized to improve the scalability and performance of neural network implementations by
using the networks' intrinsic parallelism. One common neural network architecture that lends itself
well to hardware implementations is the Restricted Danube machine, which we shall explore in
detail. With the suggested general-purpose hardware architecture, the O(n22) issue can be solved
with just O(n) resources, making it feasible to implement. A 100MHz Xilinx Virtex II-Pro XC2VP70
FPGA is used to test the framework. With these means, a Restricted Boltzmann machine with 128
by 128 nodes may operate at a pace of 1.02 billion connection-updates per second, which is a 35-
fold increase in speed over an optimized C code running on a 2.8 GHz Intel CPU.
Authors: D. L. Ly and P. Chow.

DEEP NERUAL NETWORKS

"Stacked neural networks," or multi-layered networks, are what we mean when we talk about "deep
learning." Nodes are what make up the layers. A node is only a location for computing, and it is
based on the analogy of a neurons in the mind of a person, which "fires" when it receives enough
stimulation. For the purpose of learning a task, a node combines data input wit a set of coefficients,
and or weights, that either boost or reduce the relevance of that input. (Which input is best for error-
free data classification, for instance?) The activation function of a node decides whether or not the
signal from its input-weight product should be passed on to subsequent layers of the network to
influence the final outcome, such as a classification.

The number of node layers through which input flows in a multistep process of pattern recognition
is what sets deep-learning networks apart from the more conventional single-hidden-layer neural
networks.

Figure 2. The schematic diagram of DNNs for Mnist

Fig.2 illustrates a deep neural network for handwritten digits recognition which composed of Mnist
is an array of handwritten digits, and its structure is as follows: one input layer, numerous hidden
layers, and one output layer. In this study, DNNs serve as an illustration. Both the prediction process
and the training process are computational modes in DNNs. The weight coefficients obtained during
96
N.Vimala 2023 Advanced Engineering Science

the training phase are used to calculate the output for each input during the prediction step, which is
a feed forward calculation. There are two phases to the training process: pre-training, in which the
connection weights connecting the units in neighboring layers are tuned locally using the training
datasets [4]; and global training, in which the connection weights are tuned worldwide using the
Back Spreading algorithm (BP algorithm). We use the prediction method instead of the training
procedure due to technological and commercial constraints.

3.2 RESTRICTED BOLTZMANN MACHINE THEORY.

This section provides a high-level overview of RBM mechanics, including some of the basic
vocabulary, mathematical foundation, and technique involved. A RBM is a type of stochastic,
generative neural network. Given a set of patterns, a neural network is going to able to construct an
internal model that can recognize fresh data in the same distribution, making it useful for modeling
the statistical habits of a given collection of data.

Figure 3: A schematic diagram of a Restricted Boltzmann machine with labelled components.

Additionally, the network may produce data from that distribution thanks to the underlying model's
generative characteristic. Because it employs a probabilistic strategy, the RBM may be thought of as
stochastic. The RBM uses statistical procedures to ascertain the probability distribution of statistical
attributes of a dataset. RBMs are distinct from conventional neural networks in both their design and
their applications due to their generative and stochastic qualities.
The RBM may be trained like any other neural network. The settings can be adjusted repeatedly in
accordance with a particular data set using the learning rules rather than being set manually. As a
result, the RBM is applied to a collection of data vectors that serve as the training set. The RBM uses
iterative processing of the training data to get the desired result. The RBM has reached an adequate
level of training at this point. The test data is a fresh collection of data vectors that haven't been used
in training yet, and they're used to check the model's behavior.
The RBM is a synthesis of neural network theory and statistical mechanics, therefore its language
reflects the origins of these two disciplines. Nodes are commonly used to describe processing
components in neural networks. The nodes can be in one of two possible states—on or off. There are
two sets of nodes in an RBM—the external layer and the internal layer. The data in the network's
hidden layer is represented internally, while the data in the visible layer is used for input/output.
Nodes on the same layer are not connected to one another, but all other nodes are. The weights
assigned to these associations are what the RBM uses to train itself.

97
N.Vimala 2023 Advanced Engineering Science

The ith and jth nodes' binary states in the visible and hidden layers, respectively, are denoted by vi
and hj, whereas wi,j represents the connection weight between the ith and jth nodes. In Fig. 3 we
have a simplified diagram that encapsulates all of the relevant vocabulary and nomenclature.

PROPOSED SYSTEM
To address these issues, we introduce DLAU, a scalable deep learning accelerator unit designed to
expedite the algorithm's kernel computations. In particular, we design large-scale neural networks
by making use of tiling methods, FIFO buffers, and pipelines to reduce the number of memory
transfer operations and maximize the reusability of the computational nodes. The following
contributions set this method apart from its predecessors.
First, we use tiling methods to divide up the massive input data so that we may investigate the
application's locality using deep learning. To take advantage of the tradeoffs between speedup and
hardware costs, the DLAU architecture may be set up to handle multiple amounts of tile data.
Therefore, the accelerator based on FPGAs is more flexible in its ability to adapt to various machine
learning tasks.
2) The DLAU accelerator has a tiling matrix multiplication unit (TMMU), a part sum accumulation
unit (PSAU), and an activation function acceleration unit (AFAU), all of which work in parallel.
Various System

Fig. 6. DLAU accelerator architecture. topologies

such as CNN, DNN, or even emerging neural networks can be composed from these basic modules.
Consequently, the scalability of FPGA-based accelerator is higher than ASIC-based accelerator.

TILE TECHNIQUES AND HOT SPOT PROFILING.

In order to train each layer of a deep network quickly and effectively, RBMs have become
popular. There is often just one input layer, numerous hidden layers, and then a single classifier layer
in a DNN. There is a weighted connection between all the nodes in neighbouring layers. With the
existing network configurations, prediction involves feedforward computation from input neurons to
output neurons. In the training process, the connection weights between units in neighboring layers
are tuned locally during pretraining, and then globally during global training via the back propagation
(BP) technique.

98
N.Vimala 2023 Advanced Engineering Science

Iterative calculations with minimal conditional branch operations are common in large-scale
DNNs, making them a good candidate for hardware parallelization. In this study, we begin by
utilizing the profiler to investigate the hotspot. Matrix multiplication (MM), activation, and vector
operations account for a large fraction of the total execution time, as seen in Fig. 1. MM are crucial
to the success of the three most emblematic activities, which are feed forward, RBM, and BP. In
particular, it necessitates 99.1% of all BP activities, 100% of RBM operations, and 98.6% of all feed
forward operations. Only 1.40 percent, 1.48 percent, and 0.42 percent of the three operations are used
by the activation function. Profiling experiments show that MM accelerators may greatly boost
system speedup when properly designed and implemented.
However, in comparison to GPU and CPU optimization approaches, FPGA implementations
face a substantial difficulty due to the large memory bandwidth and compute resources required to
enable the parallel processing. In this study, we apply tile algorithms to split the enormous input data
set into tiled subsets and thereby attack the problem. Each custom hardware accelerator can store the
tiled portion of data in its own buffer before processing it. The accelerator design is utilized to support
the massive neural networks. Also, the hardware accelerators can be used in parallel with the data
access for each tiled subset.
In particular, the output neurons in one iteration become the input neurons in the following
iteration. At each iteration, the weights matrix is multiplied by each column of input neurons to
produce a new set of output neurons. Algorithm 1 shows how the input values are tiled and multiplied
by the matching

TABLE I PROFILING OF HOT SPOTS OF DNN

99
N.Vimala 2023 Advanced Engineering Science

weights. After each component's sum is determined, the total is tallied. We additionally tiled the
weight matrix to match to the neuron input/output sizes. This results in a huge reduction in hardware
requirements for the accelerator, as its price is now just proportional to the size of the tiles. The tiling
approach is effective because it permits the deployment of extensive networks while making use of
relatively few physical resources. FPGA technology has an additional benefit over GPU design
because of its pipelined hardware implementation, which utilizes huge parallel SIMD architectures
to boost overall speed and throughput. In this paper, we implement the specialized accelerator to
speed up the MM and activation functions, which, as shown in Table I of the profiling results, are
common but important computational parts during the prediction process and the training process in
deep learning algorithms.
DUAL ARCHITECTURE AND EXECUTION MODELL.
The DLAU accelerator, DDR3 memory controller, direct memory access module, and
integrated processor are all shown in Fig. 6. The embedded processor talks to DLAU through JTAG-
UART and provides a programming interface for the users. The DLAU accelerator is activated, input
data and weight matrix are stored in internal BRAM blocks, and the results are delivered to the user
after execution. The DLAU is designed as a modular component that may be configured for a wide
variety of uses. The DLAU is made up of three processing units set up in a sequential fashion:
1) TMMU;
2) PSAU; and
3) AFAU.

When executed, DLAU retrieves tiled data from memory through DMA, processes it using
each of the three processors in turn, and then stores the processed data back in memory. The
following are distinguishing characteristics of the DLAU accelerator design.

100
N.Vimala 2023 Advanced Engineering Science

Fig. 7. TMMU schematic.

To receive or transmit data in FIFO order, each processing unit in DLAU is equipped with
an input buffer and an output buffer. These buffers are used to avoid information loss due to varying
throughput between CPUs.
Different machine learning tasks may call for different-sized neural networks, which may be
accommodated via tiling techniques. To make the accelerator suitable for networks of varying sizes,
we tile the data into manageable chunks that can be cached in-chip. Therefore, the accelerator based
on FPGAs is more flexible in its ability to adapt to various machine learning tasks.
As a result of our pipeline accelerator, the TMMU, PSAU, and AFAU are able to perform
computations in a streaming-like manner by exchanging data with one another via a stream-like data
passing mechanism (for example, AXI-Stream for demonstration). The Total Weights and Tiled
Nodes Data is read by the TMMU, which is the primary computing unit, and the calculations are
performed. The intermediate part sum results are then sent to the PSAU. PSAU compiles and totals
up the sums of the individual parts. Once the data has been accumulated, it will be sent to AFAU.
Piecewise linear interpolation techniques are used by AFAU to carry out the activation function. The
remaining text of this article will focus on the third, second, and first CPUs and how they were each
implemented.

TMMU Architecture
Multiplication and adding up are TMMU's responsibilities. Part sums are calculated using
TMMU, which is tailored to take use of the weights' data localisation. The component sums are sent
to PSAU through an output FIFO buffer, while the input FIFO buffer gets data via DMA. In Fig.7,
we show the TMMU diagram in more detail by specifying a tile size of 32. The weight matrix is read
by TMMU from the input buffer and then distributed across the BRAMs in a 32 by i fashion, where
n is the total number of BRAMs and i is the row number of the weight matrix. Then, TMMU will
start storing the tiled node data in a buffer. When TMMU initially begins execution, it loads the tiled
32 values into register Reg_a. Every cycle, TMMU receives the next node from the input buffer and
writes it to the registers named Reg_b in parallel with the calculation. As a result, Reg_a and Reg_b
are interchangeable registers.
We employ a pipelined binary adder tree structure to execute the computation quickly and
101
N.Vimala 2023 Advanced Engineering Science

efficiently. Figure 7 shows that BRAMs and registers are used to store information about nodes and
their associated weights. Coarse-grained accelerators are used in a time-sharing pipeline. As a result,
the TMMU unit is able to generate a part sum result at the end of each clock cycle using this approach.

Fig. 8. PSAU schematic.

PSAU Architecture
The task of accumulation falls on PSAU's shoulders. The PSAU shown in Fig. 8 is the hardware used
to store the total of the TMMU's output parts. If the total of the parts is the final value, PSAU will
store it in the output buffer before sending the pipelined results to AFAU. The throughput of PSAU
accumulation is equivalent to the rate at which the part sum is generated in TMMU, as PSAU may
accumulate one part sum each clock cycle.
AFAU Architecture
At last, AFAU uses piecewise linear interpolation to realize the activation function (y = ai x
+ bi, x [x1, xi+1]). When the difference between xi and xi+1 doesn't matter much, this approach has
been used extensively to design activation functions. As an example of a sigmoid function equation,
consider (1). When x is more than 8 and less than -8, the answers are sufficiently close to the limits
of 1 and 0, respectively. Different functions are set up for the ranges 8 x 0 and 0 x 8. In total, we cut
the sigmoid into four distinct chunks.

To keep the throughput with other processing units constant, AFAU, like PSAU, employs input and
output buffers. To be more specific, we keep a and b's values in two different BRAMs. Every time
the clock ticks over, the sigmoid function used in the AFAU calculation gets put through its paces.
Therefore, the DLAU accelerator architecture's maximum throughput is guaranteed by completely
pipelined operation across all three processing units.

102
N.Vimala 2023 Advanced Engineering Science

RESULTS

103
N.Vimala 2023 Advanced Engineering Science

CONCLUSION

In this study, we introduce DLAU, an FPGA-based deep learning accelerator designed for scalability
and adaptability. For large-scale neural networks, the DLAU's three pipelined processing units come
in handy. In order to compute frequently by time-sharing the arithmetic logic, DLAU utilizes tile
methods to split the input node data into smaller groupings. In experiments conducted on a prototype
Xilinx FPGA, it was shown that DLAU could yield a speedup of 36.1% with a little increase in
hardware cost and minimal increase in power consumption.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444,
2015.

[2] [2] J. Hauswald et al., “DjiNN and Tonic: DNN as a service and its implications for future
warehouse scale computers,” in Proc. ISCA, Portland, OR, USA, 2015, pp. 27–40.

[3] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional neural
networks,” in Proc. FPGA, Monterey, CA, USA, 2015, pp. 161–170.
[4] P. Thibodeau. Data Centers are the New Polluters. Accessed on Apr. 4, 2016. [Online]. Available:
http://www.computerworld.com/ article/2598562/data-center/data-centers-are-the-new-
polluters.html.
[5] D. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted Boltzmann
machines,” in Proc. FPGA, Monterey, CA, USA, 2009, pp. 73–82.
[6] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-
learning,” in Proc. ASPLOS, Salt Lae City, UT, USA, 2014, pp. 269–284.
[7] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable restricted

104
N.Vimala 2023 Advanced Engineering Science

Boltzmann machine FPGA implementation,” in Proc. FPL, Prague, Czech Republic, 2009, pp. 367–
372.
[8] Q. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process accelerator
based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.
[9] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural network,”
in Proc. FPGA, Monterey, CA, USA, 2016, pp. 26–35.

[10] Capacity trust assessment for multi-hop routing in wireless sensor networks,Sowmya
Gali1, , Alekya Himabindu B.1, Nagamani V.1, Jayamangala S.1, Munawwar S.1, Mallikarjuna
Rao Y, E3S Web of Conferences 391, 01181 (2023).
[11] Multiple Sensor Data Fusion Based Top-Down Embedded System for Automatic Plant Pot
Watering,B Saraswati, MV Subramanyam, Journal of Algebraic Statistics 13 (1), 884-891,2022.
[12] “Despite Non-Uniform Motion Blur, Illumination, And Noise, Face Recognition”,International
Journal Of Progressive Research In Engineering Management And Science (IJPREMS) vol. 02, issue
08, august 2022, pp : 25-29 e-issn : 2583-1062 impact factor : 2.265.
[13] “Vlsi Implementation Of Ternary Adder And Multiplier Using Tanner Tool”, Journal of
Pharmaceutical Negative Results , Volume 13 ,Special Issue 5 ,2022.
[14] Novel Design of Ternary Arithmetic Circuits,BA Himabindu, DS Kumar, G Mahendra, DM
Srikanth, KD ReddyIJIRE Volume 3, Issue 4 (July-August 2022), PP: 172-183.,2022.
[15] CARRY SELECT ADDER DESIGN USING D-LATCH WITH LESS DELAY AND MORE
POWER EFFICIENT,OM Chandrika, BAH Bindu,IJMTES,Vol.3,Issue.7,2016.
[16] Automatic Gas Alerting System,OM Chandrika, BAH Bindu,Imperial Journal of
Interdisciplinary Research (IJIR) 2 (6), 2454-1362,2016
[17]Watermarking of digital images with iris based biometric data using wavelet and SVD,AH
Bindu, V Saraswati,Int. J. Eng. Dev. Res 4 (1), 726-731,2016
[18] Design of High Performance Pipelined Data Encryption Standard (DES) Using Xilinx Virtex-6
FPGA Technology,SB Sridevi, BA Himabindu,The International Journal of Science and
Technoledge 2 (2), 53,2014.
[19] CH.Nagaraju, Dr.Anil Kumar Sharma, Dr.M.V.Subramanyam,” A Review on BER
Performance Analysis and PAPR Mitigation in MIMO OFDM Systems”, International Journal of
Engineering Technology and Computer Research, Vol.3,No.3,PP.237-238, June, 2015.
[20] V.Jyothi, Dr.M. V. Subramanyam “ QIF-NDMRP-QOS aware Interference free node-Disjoijnt
multipath routing protocol using source-Destination edge based path selection scheme.
’Webology’.pp-773-786 , ISSN:1735- 188X, Vol.19, No:2,2022.(scopus)
[21]. Y.Mallikarjuna Rao, M.V.Subramanyam and K. Satyaprasad, “An efficient resource
management algorithms for mobility management in wireless mesh networks”, IEEE Conference on
energy, communication, data analytics and soft computing, SKR Engineering College, Chennai,
2017.
[22].Mallikarjuna Rao, M.V.Subramanyam and K. Satyaprasad, “Performance analysis of energy
aware QoS enabled routing protocol for wireless mesh networks”, International journal of smart grid
and green communications
[23]. Y.Mallikarjuna Rao et.al “Design and Performance Analysis of Buffer Inserted On-Chip Global
105
N.Vimala 2023 Advanced Engineering Science

Nano Interconnects in VSDM Technologies”, Nanotechnology for Environmental Engineering, 11

May, 2022, Springer.

106

Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Hardware Design For Machine Learning
No ratings yet
Hardware Design For Machine Learning
22 pages
Estimation of Age and Gender Using Convolutional Neural Network
100% (1)
Estimation of Age and Gender Using Convolutional Neural Network
35 pages
New Dlau
No ratings yet
New Dlau
52 pages
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
100% (1)
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
37 pages
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
No ratings yet
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
7 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
A General Neural Network Hardware Architecture On FPGA
No ratings yet
A General Neural Network Hardware Architecture On FPGA
6 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Design and Implementation of Deep Neural Network Hardware Chip and Its Performance Analysis
No ratings yet
Design and Implementation of Deep Neural Network Hardware Chip and Its Performance Analysis
10 pages
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (IJAIA)
No ratings yet
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (IJAIA)
36 pages
Unit I
No ratings yet
Unit I
10 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Benchmarking Contemporary Deep Learning Hardware and Frameworks A Survey of Qualitative Metrics
No ratings yet
Benchmarking Contemporary Deep Learning Hardware and Frameworks A Survey of Qualitative Metrics
8 pages
High Performance FPGA Based CNN Accelerator
No ratings yet
High Performance FPGA Based CNN Accelerator
4 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
CH 8
No ratings yet
CH 8
42 pages
31.DL - Post-Proofreading 1-5
No ratings yet
31.DL - Post-Proofreading 1-5
5 pages
Article
No ratings yet
Article
10 pages
Deep L Earning
No ratings yet
Deep L Earning
7 pages
Unit 4 Notes New
No ratings yet
Unit 4 Notes New
49 pages
Deep Learning
100% (3)
Deep Learning
32 pages
Deep Learning On Mobile Devices-A Review
No ratings yet
Deep Learning On Mobile Devices-A Review
15 pages
AI Discussion Intro
No ratings yet
AI Discussion Intro
8 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
Artificial Intelligence - Benefits and Applications of Deep Learning
No ratings yet
Artificial Intelligence - Benefits and Applications of Deep Learning
7 pages
AI Module V
No ratings yet
AI Module V
40 pages
Application and Methods of Deep Learning in IoT
No ratings yet
Application and Methods of Deep Learning in IoT
7 pages
Jan 19ijamte - CW
No ratings yet
Jan 19ijamte - CW
9 pages
The Application of Deep Learning in Autonomous Driving
No ratings yet
The Application of Deep Learning in Autonomous Driving
5 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
65 pages
ML Unit 4
No ratings yet
ML Unit 4
16 pages
Deep Learning
No ratings yet
Deep Learning
22 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Deep Learning A Review
No ratings yet
Deep Learning A Review
11 pages
MythicWhitepaper 2019oct31
No ratings yet
MythicWhitepaper 2019oct31
9 pages
31.DL - Post-Proofreading 1-4
No ratings yet
31.DL - Post-Proofreading 1-4
4 pages
Deep Learning Applications and Image Processing
No ratings yet
Deep Learning Applications and Image Processing
5 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
Advanced Intelligent Systems - 2024 - Song - Hardware For Deep Learning Acceleration
No ratings yet
Advanced Intelligent Systems - 2024 - Song - Hardware For Deep Learning Acceleration
20 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
Hardware For Deep Learning Acceleration
No ratings yet
Hardware For Deep Learning Acceleration
20 pages
What Is Deep Learning Basics
No ratings yet
What Is Deep Learning Basics
11 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
DaDianNao A Machine-Learning Supercomputer
No ratings yet
DaDianNao A Machine-Learning Supercomputer
14 pages
Sway 020 A
No ratings yet
Sway 020 A
7 pages
UNIT-1 Foundations of Deep Learning
100% (1)
UNIT-1 Foundations of Deep Learning
51 pages
Mipsology Aws f1
No ratings yet
Mipsology Aws f1
10 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Lesson 02 Introduction To Deep Learning
No ratings yet
Lesson 02 Introduction To Deep Learning
74 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Advancements and Applications of Deep Learning
No ratings yet
Advancements and Applications of Deep Learning
4 pages
2022 - A Survey On Deep Learning For Software Engineering
No ratings yet
2022 - A Survey On Deep Learning For Software Engineering
73 pages
The First Artificial Neuron
No ratings yet
The First Artificial Neuron
2 pages
Deep Learning
No ratings yet
Deep Learning
14 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Singa Tomm
No ratings yet
Singa Tomm
23 pages
Software Defined Networking (SDN) - a definitive guide
From Everand
Software Defined Networking (SDN) - a definitive guide
Rajesh Kumar Sundararajan
2/5 (2)
Android Developer Internship 10 Weeks
No ratings yet
Android Developer Internship 10 Weeks
13 pages
Android Developer Virtual Internship Week 1
No ratings yet
Android Developer Virtual Internship Week 1
2 pages
R164104D102019
No ratings yet
R164104D102019
4 pages
R164104D052022
No ratings yet
R164104D052022
1 page
Chap 7.1 Sequence Analysis Using FFN
No ratings yet
Chap 7.1 Sequence Analysis Using FFN
47 pages
1 s2.0 S1568494621001976 Main
No ratings yet
1 s2.0 S1568494621001976 Main
10 pages
Design Implementation and Evaluation of Online Eng
No ratings yet
Design Implementation and Evaluation of Online Eng
11 pages
Lecture 6 - Multi-Layer Feedforward Neural Networks Using Matlab Part 2
No ratings yet
Lecture 6 - Multi-Layer Feedforward Neural Networks Using Matlab Part 2
3 pages
Siddhartha: Course File Contents
No ratings yet
Siddhartha: Course File Contents
25 pages
Parallel Networks That Learn To Pronounce English Text: Terrence J. Sejnowski
No ratings yet
Parallel Networks That Learn To Pronounce English Text: Terrence J. Sejnowski
24 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Unit 5 2
No ratings yet
Unit 5 2
31 pages
Approach To The Synthesis of Neural Network Structure During Classification
No ratings yet
Approach To The Synthesis of Neural Network Structure During Classification
7 pages
Lecture8 NN 1
No ratings yet
Lecture8 NN 1
58 pages
LLM in A Flash: Efficient Large Language Model Inference With Limited Memory
No ratings yet
LLM in A Flash: Efficient Large Language Model Inference With Limited Memory
12 pages
Intelligent Control of Coke Oven
No ratings yet
Intelligent Control of Coke Oven
8 pages
Ann 1
No ratings yet
Ann 1
102 pages
Unit3 DL JNTUK
No ratings yet
Unit3 DL JNTUK
15 pages
Soft Computing MCQ
No ratings yet
Soft Computing MCQ
10 pages
Power System Transient Stability Analysis by Neural Networks
No ratings yet
Power System Transient Stability Analysis by Neural Networks
13 pages
Artificial Intelligence ME: Manufacturing 6324
No ratings yet
Artificial Intelligence ME: Manufacturing 6324
23 pages
Thesis - Anomaly Detection
No ratings yet
Thesis - Anomaly Detection
57 pages
Soft Computing
No ratings yet
Soft Computing
30 pages
Syllabus Structural Engg
No ratings yet
Syllabus Structural Engg
32 pages
Ajol File Journals - 483 - Articles - 216226 - Submission - Proof - 216226 5701 532718 1 10 20211018
No ratings yet
Ajol File Journals - 483 - Articles - 216226 - Submission - Proof - 216226 5701 532718 1 10 20211018
9 pages
Preliminaries: Biological Neuron To Artificial Neural Network
No ratings yet
Preliminaries: Biological Neuron To Artificial Neural Network
21 pages
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
No ratings yet
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
15 pages
Fault Identification System For Electric Power Transmission Lines Using Artificial Neural Networks
No ratings yet
Fault Identification System For Electric Power Transmission Lines Using Artificial Neural Networks
8 pages
Deep Learning
No ratings yet
Deep Learning
38 pages
CSM 422
No ratings yet
CSM 422
2 pages
Electrical and Electronics Engineering Vii Semester: Course Theory/Lab L T P C Code
No ratings yet
Electrical and Electronics Engineering Vii Semester: Course Theory/Lab L T P C Code
66 pages
Neural Network Assignment 1 by Gourav Meena
No ratings yet
Neural Network Assignment 1 by Gourav Meena
14 pages
Design, Simulation and Implementation of An Adaptive Controller On Base of Artificial Neural Networks For A Resonant DC-DC Converter
No ratings yet
Design, Simulation and Implementation of An Adaptive Controller On Base of Artificial Neural Networks For A Resonant DC-DC Converter
4 pages

Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar

Uploaded by

Dlau: A Scalable Deep Learning Accelerator Unit On Fpga N.Vimala, B.Alekya Himabindu, Y.Mallikarjuna Rao, G.Sowmya, S.Girish Babu, M.Mahesh Kumar

Uploaded by

ISSN: 2096-3246

Volume 55, Issue 02, September, 2023

DLAU: A SCALABLE DEEP LEARNING ACCELERATOR UNIT ON FPGA

1N.Vimala, 2B.Alekya Himabindu, 3Y.Mallikarjuna Rao, 4G.Sowmya, 5S.Girish Babu,

Motivation for FPGA implementation.

Fig.1 Neural Network.

Applications of neural networks.

A high-performance FPGA architecture for restricted Boltzmann machines.

DEEP NERUAL NETWORKS

Figure 2. The schematic diagram of DNNs for Mnist

3.2 RESTRICTED BOLTZMANN MACHINE THEORY.

Figure 3: A schematic diagram of a Restricted Boltzmann machine with labelled components.

Fig. 6. DLAU accelerator architecture. topologies

TILE TECHNIQUES AND HOT SPOT PROFILING.

TABLE I PROFILING OF HOT SPOTS OF DNN

Fig. 7. TMMU schematic.

Fig. 8. PSAU schematic.

Nano Interconnects in VSDM Technologies”, Nanotechnology for Environmental Engineering, 11

You might also like