0% found this document useful (0 votes)
29 views47 pages

5 Lecture 28 01 25

The document discusses various dataflow architectures for deep neural network (DNN) execution, including Output Stationary, Weight Stationary, and Input Stationary. It introduces the concept of DNN accelerators, highlighting their advantages over CPUs and GPUs for specific applications. The document also covers the architecture of the NVIDIA Deep Learning Accelerator (NVDLA) and the DianNao accelerator, focusing on memory access optimization and performance metrics.

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views47 pages

5 Lecture 28 01 25

The document discusses various dataflow architectures for deep neural network (DNN) execution, including Output Stationary, Weight Stationary, and Input Stationary. It introduces the concept of DNN accelerators, highlighting their advantages over CPUs and GPUs for specific applications. The document also covers the architecture of the NVIDIA Deep Learning Accelerator (NVDLA) and the DianNao accelerator, focusing on memory access optimization and performance metrics.

Uploaded by

rppay777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Mem Sensor

μP Battery

E0 294: Systems for Machine Learning

Lecture #5
28th January 2025
Previous Class
▪ Some more CNNs
▪ Technique to reduce FLOPS
▪ OS dataflow

2
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator

3
2-D Convolution – Output Stationary

4
2-D Convolution – Output Stationary (1)

5
2-D Convolution – Output Stationary (2)

6
2-D Convolution – Output Stationary (3)

7
2-D Convolution – Output Stationary (4)

8
2-D Convolution – Output Stationary (5)

9
2-D Convolution – Output Stationary (6)

10
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)

11
Weight Stationary (WS) Dataflow

▪ Broadcast activations and accumulate partial sums (psums) spatially across the
PE array
▪ Weights are stored in PE array
– Minimize read energy consumption
– Maximize convolutional and filter reuse of weights
▪ Weights Stationary
– Weights do not move (stay in PEs) until all computations related to them are
finished
12
1-D Convolution – Weight Stationary

13
Weight Stationary – Reference Pattern
▪ Observations
– Single weight is reused many times (E)
– Large sliding window of inputs (E)
– Fixed window of outputs (E)

14
Data Access for WS

Output Stationary Weight Stationary

15
Relation to WS Architecture

▪ In WS architecture, weights stay in each PE – P0:W0, P1:W1…


▪ Inputs are broadcasted into all PEs: Pi: I[0], I[1], I[2],….
▪ During computation, each PE performs the partial sum at each cycle and sends
to the next PE – spatially accumulate
– O[0] = I[0]W[0] + I[1]W[1] + I[2]W[2] + I[3]W[3]
– O[6] = I[6]W[0] + I[7]W[1] + I[8]W[2] + I[9]W[3]
▪ P0 -> P1 -> P2 -> P3 ….
16
2-D Convolution – Weight Stationary (1)

17
2-D Convolution – Weight Stationary (2)

18
2-D Convolution – Weight Stationary (3)

19
2-D Convolution – Weight Stationary (4)

20
2-D Convolution – Weight Stationary (5)

21
2-D Convolution – Weight Stationary (6)

22
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)

23
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)

24
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)

25
Input (IS) Data Flow
▪ Minimize activation read energy consumption
– Input stays in PEs until they are no longer needed
– Maximize convolutional and map reuse of activations
▪ Unicast weights and accumulate psums spatially across the PE array

26
IS Data Flow

27
Cost of different data flow

28
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator

29
Accelerator: Why/What?
▪ Drawbacks of CPUs
– Less parallelism
▪ Drawback of GPU
– Less memory
▪ Accelerators
– Application specific
– Tailored only for DNNs
– Mostly inference
▪ Major performance metrics
– Inference latency, energy, area

30
DianNao (Computer) [1]
▪ One of the earliest accelerator for DNN models
▪ Tiling to reduce memory usage
– Classifier layer
– Convolutional layer
– Pooling layer
▪ An architecture tailored for DNN inference

[1] Chen, Tianshi, et al. "Diannao: A small-footprint high-


throughput accelerator for ubiquitous machine-learning."
ACM SIGARCH Computer Architecture News 42.1 (2014):
269-284.

31
General Structure of a DNN
▪ Sequential layers
▪ Feature maps
▪ Convolutional (nonlinear after the output), pooling, classifier
– Same kernels sometimes/private kernels
– Pooling: average and max Quiz: what is sx and sy in the figure?
– Classifier is a multi/single layer perceptron

32
Classifier Layer
a[j]
▪ Notations
• Tin= Number of input neurons
w[i][j]
s[i] • a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons

Output
• w[i][j] = Weight from jth input neuron to ith output
Input

Tout=2 neuron
• s[i] = Sum at ith output neuron
▪ Tin=8, Tout=2 in this example
▪ Implementation for (i=0; i<Tout; i++) {
Tin=8
▪ #memory accesses? for (j=0; j<Tin; j++) {
s[i] += a[j].w[i][j];
}
}
33
Memory accesses in classifier layer
▪ Notations
• Tin= Number of input neurons
• a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons
• w[i][j] = Weight from jth input neuron to ith output neuron
• s[i] = Sum at ith output neuron
for (i=0; i<Tout; i++) {
▪ #memory accesses? for (j=0; j<Tin; j++) {
– Activations: Tin × Tout s[i] += a[j].w[i][j];
– Weights: Tin × Tout }
– Sum: Tout
}
– Total: Tin × Tout + Tin × Tout + Tout
▪ Drawback
– Number of input neurons (Tin) can be very large (10 – 1L)
– Can not fit in L1
34
How to Reduce the Memory Access
a[i]
▪ Rather, how to use L1?
w[i][j]
▪ Tiling
s[j] – Divide the input neurons into tiles
– Store the tiles in L1
Output
– Compute the contribution from each tile separately
Input

Tout=2 – Accumulate the outputs


▪ Trade off
– Number of times a[i] needs to be fetched from main memory
is less (+)
Tin=8
– Number of times s[j] needs to be fetched is more (-)

35
Tiling for Convolutional and Pooling Layers
▪ Tiling for convolutional layers
– Input feature maps
– Output feature maps
▪ Pooling layers
– No kernels (why?)
– Tiling does not improve the BW requirement
drastically
▪ Please go through the papers and come
back with any question(s)
▪ Expect similar question(s) in the quiz Is this a good figure?

36
A Naïve Hardware Implementation
▪ Fully layout the neurons and the synapses on the
w[i][j] silicon
s[j]
– Neurons: logic circuits performing computation

Output
– Synapses: latches/RAMs storing the weights
Input

– Non linear functions are implemented as piecewise


Tout=2 linear function
▪ Advantage
– High speed: distance travelled by the data is less

Tin=8
▪ Disadvantage
– Area increases drastically with increase in the number
of neurons

37
Performance of the Naïve Implementation
▪ Area and energy suffers a lot with increasing number of input and output
neurons

38
Accelerator Architecture in DianNao
▪ Storage
– NBin: To store input activations
– SB: To store synapses
– NBout: To store outputs
▪ Neural functional unit (NFU)
– NFU-1: multiplier
– NFU-2: adder tree
– NFU-3: non linear
▪ Control instructions

39
Storage (1)
▪ Split buffers: Three different buffers
instead of single cache
– Reduced conflict
– Reduced access time
▪ Scratchpad memory
▪ Point to be noted
– Different width of NBin/NBout and SB
enables reading of different width of
data
for (i=0; i<To; i++) {
for (j=0; j<Ti; j++) {
s[i] += a[j].w[i][j];
}
}
40
Storage (2)
▪ NBout is a Circular buffer
– Partial sums are pushed into NBout
instead of sending them back into the
main memory and fetching them again
– Output is read from NBout only when
all the input neurons are integrated for
the partial sum
– Useful while tiling

for (i=0; i<To; i++) {


for (j=0; j<Ti; j++) {
s[i] += a[j].w[i][j];
}
}
41
NFU-1: Multipliers
▪ Convolutional layers require
multiplication
▪ 16-bit fixed point multipliers are
used instead of 32-bit floating
point multipliers
▪ Comments (?)

Error

Performance

42
NFU-2: Adders
▪ Adder trees
▪ Advantages of adder trees
– Less generate and propagate
– Less latency

43
NFU-3: Non linear Unit
▪ Involves non-linear operations
– Power of exponential, division etc.
▪ Any optimization idea?
– Implement as piecewise linear function

44
Control Instructions

Control Instruction Format

An example:
Tin=8192, Tout=256, 64-
entry buffers

45
Results

46
Mem Sensor

μP Battery

THANK YOU

You might also like