5 Lecture 28 01 25
5 Lecture 28 01 25
μP Battery
Lecture #5
28th January 2025
Previous Class
▪ Some more CNNs
▪ Technique to reduce FLOPS
▪ OS dataflow
2
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator
3
2-D Convolution – Output Stationary
4
2-D Convolution – Output Stationary (1)
5
2-D Convolution – Output Stationary (2)
6
2-D Convolution – Output Stationary (3)
7
2-D Convolution – Output Stationary (4)
8
2-D Convolution – Output Stationary (5)
9
2-D Convolution – Output Stationary (6)
10
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)
11
Weight Stationary (WS) Dataflow
▪ Broadcast activations and accumulate partial sums (psums) spatially across the
PE array
▪ Weights are stored in PE array
– Minimize read energy consumption
– Maximize convolutional and filter reuse of weights
▪ Weights Stationary
– Weights do not move (stay in PEs) until all computations related to them are
finished
12
1-D Convolution – Weight Stationary
13
Weight Stationary – Reference Pattern
▪ Observations
– Single weight is reused many times (E)
– Large sliding window of inputs (E)
– Fixed window of outputs (E)
14
Data Access for WS
15
Relation to WS Architecture
17
2-D Convolution – Weight Stationary (2)
18
2-D Convolution – Weight Stationary (3)
19
2-D Convolution – Weight Stationary (4)
20
2-D Convolution – Weight Stationary (5)
21
2-D Convolution – Weight Stationary (6)
22
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)
23
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)
24
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)
25
Input (IS) Data Flow
▪ Minimize activation read energy consumption
– Input stays in PEs until they are no longer needed
– Maximize convolutional and map reuse of activations
▪ Unicast weights and accumulate psums spatially across the PE array
26
IS Data Flow
27
Cost of different data flow
28
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator
29
Accelerator: Why/What?
▪ Drawbacks of CPUs
– Less parallelism
▪ Drawback of GPU
– Less memory
▪ Accelerators
– Application specific
– Tailored only for DNNs
– Mostly inference
▪ Major performance metrics
– Inference latency, energy, area
30
DianNao (Computer) [1]
▪ One of the earliest accelerator for DNN models
▪ Tiling to reduce memory usage
– Classifier layer
– Convolutional layer
– Pooling layer
▪ An architecture tailored for DNN inference
31
General Structure of a DNN
▪ Sequential layers
▪ Feature maps
▪ Convolutional (nonlinear after the output), pooling, classifier
– Same kernels sometimes/private kernels
– Pooling: average and max Quiz: what is sx and sy in the figure?
– Classifier is a multi/single layer perceptron
32
Classifier Layer
a[j]
▪ Notations
• Tin= Number of input neurons
w[i][j]
s[i] • a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons
Output
• w[i][j] = Weight from jth input neuron to ith output
Input
Tout=2 neuron
• s[i] = Sum at ith output neuron
▪ Tin=8, Tout=2 in this example
▪ Implementation for (i=0; i<Tout; i++) {
Tin=8
▪ #memory accesses? for (j=0; j<Tin; j++) {
s[i] += a[j].w[i][j];
}
}
33
Memory accesses in classifier layer
▪ Notations
• Tin= Number of input neurons
• a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons
• w[i][j] = Weight from jth input neuron to ith output neuron
• s[i] = Sum at ith output neuron
for (i=0; i<Tout; i++) {
▪ #memory accesses? for (j=0; j<Tin; j++) {
– Activations: Tin × Tout s[i] += a[j].w[i][j];
– Weights: Tin × Tout }
– Sum: Tout
}
– Total: Tin × Tout + Tin × Tout + Tout
▪ Drawback
– Number of input neurons (Tin) can be very large (10 – 1L)
– Can not fit in L1
34
How to Reduce the Memory Access
a[i]
▪ Rather, how to use L1?
w[i][j]
▪ Tiling
s[j] – Divide the input neurons into tiles
– Store the tiles in L1
Output
– Compute the contribution from each tile separately
Input
35
Tiling for Convolutional and Pooling Layers
▪ Tiling for convolutional layers
– Input feature maps
– Output feature maps
▪ Pooling layers
– No kernels (why?)
– Tiling does not improve the BW requirement
drastically
▪ Please go through the papers and come
back with any question(s)
▪ Expect similar question(s) in the quiz Is this a good figure?
36
A Naïve Hardware Implementation
▪ Fully layout the neurons and the synapses on the
w[i][j] silicon
s[j]
– Neurons: logic circuits performing computation
Output
– Synapses: latches/RAMs storing the weights
Input
Tin=8
▪ Disadvantage
– Area increases drastically with increase in the number
of neurons
37
Performance of the Naïve Implementation
▪ Area and energy suffers a lot with increasing number of input and output
neurons
38
Accelerator Architecture in DianNao
▪ Storage
– NBin: To store input activations
– SB: To store synapses
– NBout: To store outputs
▪ Neural functional unit (NFU)
– NFU-1: multiplier
– NFU-2: adder tree
– NFU-3: non linear
▪ Control instructions
39
Storage (1)
▪ Split buffers: Three different buffers
instead of single cache
– Reduced conflict
– Reduced access time
▪ Scratchpad memory
▪ Point to be noted
– Different width of NBin/NBout and SB
enables reading of different width of
data
for (i=0; i<To; i++) {
for (j=0; j<Ti; j++) {
s[i] += a[j].w[i][j];
}
}
40
Storage (2)
▪ NBout is a Circular buffer
– Partial sums are pushed into NBout
instead of sending them back into the
main memory and fetching them again
– Output is read from NBout only when
all the input neurons are integrated for
the partial sum
– Useful while tiling
Error
Performance
42
NFU-2: Adders
▪ Adder trees
▪ Advantages of adder trees
– Less generate and propagate
– Less latency
43
NFU-3: Non linear Unit
▪ Involves non-linear operations
– Power of exponential, division etc.
▪ Any optimization idea?
– Implement as piecewise linear function
44
Control Instructions
An example:
Tin=8192, Tout=256, 64-
entry buffers
45
Results
46
Mem Sensor
μP Battery
THANK YOU